Get information from html page

Hello Guys.

I have a html page stored in a mysql db field. i use htmlentities to display the pure text with no markup.
I would like to search for some text on this page, I guess I can use an array with the words I am searching for and
check if they can be found, but the problem is I need to find these words in the order they appear and
for each html page the order will not be the same.

Any help guys

Thanks
Kim+

The means of doing that in php is using preg_match_all with a regular expression. However, as the number of pages increases performance will degrade more and more. If that happens you will need to move the process to a background job /cron to prevent degraded performance on the front-end of the website. It really all depends on how many pages you’re talking about. Also if these pages are stored in the db then they had to come from somewhere. Perhaps it makes more sense to cache the word matches right after saving the page content to the database. So you could create a separate table to store each pages relevant search meta data.

Thanks for the reply, How I would I use regexp here, I can’t seem to get my head around it. As an example the page may display something liket his,

Breakfast which is a heading and then below it
Toast
eggs
bacon
coffee
tea

then we will have lunch as another heading and below it
salad
rolls
curry
burgers
sausage
cheese

My problem is I need to find the main heading and then list the item below it and the same for lunch, but the challenge is that for another page an item that appears as lunch could be breakfast for another, which is why I need to extract the information in the order they appear, something like a magnifying glass reading as it goes along and
when it find something , it stores it in an array or something.

I would appreciate any help.

Thanks
Kim

It sounds like the problem is how you are storing data.
Why is it html?
It would be better to store the data in multiple related tables to be more searchable, then the same tables can be queried to pull out the data to render the html pages.

That is what in known as a “many too many” relationship, so that requires a look-up table to match keys for an item with key for a meal.

You could use the Symfony DomCrawler to crawl the HTML instead of regex. An alternative to the DomCrawler would be query path. A nice thing about each of those is supporting a jQuery CSS Selector like syntax for selecting nodes and attributes in html/xml.

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.