Opinions on the best way to handle large XML data feed

I have an XML feed containing about 2000 properties, each property has about 10 nodes. The feed is updated at least once every 24hours.

I need to get the contents of this feed on the page in the same format as properties that are currently retrieved from the local DB. It is necessary to be able to perform a search for various terms and be able to paginate the results as per the existing content.

As I see it I have two possible methods of doing this:

  1. Parse the XML feed and insert the contents into a database. Perform some update to this every few hours to keep the data fresh.

  2. Use the feed to add the content to the page directly as required.

The first option seems rather intensive on first look but at least all the design, pagination etc is in place. The weak point would be the DB and whatever is used to run the updates.

The second option may have limitations due to the size of the XML feed. I do have limited experience in the past of using simpleXML to parse feeds and I feel the size of the feed may well slow things to a crawl. I believe I should consider using something like XMLReader which I am not familiar with so there would be some learning to do for this.

Of course there may well be other options that I am unaware of and that is where you guys come in to play. How would you proceed with this?

Thanks in advance

Colin

SimpleXML would be good for parsing the xml data as you have no need to write to xml. It’s an easy tool for the job.

Loading the xml data each time and storing it in memory could cause you problems. Your script has less memory to work with and my hit the limit in your php.ini settings. Increasing the memory will get around this, but it’s not ideal as it puts additional strain on your server that is unnecessary. Of course, I don’t know how big that xml file is, so it may not be a huge issue. If you have a decent number of users though, it’s best to avoid this method.

Depending on the access you have to your server (if you’re on shared, you might be SOL), then I’d suggest a cron job for checking the xml file regularly. Ideally you want to cache the xml locally, both for performance (don’t fetch a new one and update the db if the cache is up to date) and to prevent your site humping itself if the xml is missing for whatever reason.

For handling the data, you definitely want a db for performance. It’ll make pagination very simple (if you’re using mysql use LIMIT for this), search a doddle and it’ll be a lot easier to do analysis; should you chose to.

Thanks for your reply. The more I think about this the better the DB option becomes.

Each property has a datestamp so it would be good to be able to check this node and only insert or update records since the last check. This could not be done with SimpleXML because it loads the entire feed before parsing it as I understand.

The XML feed will continue to grow in size so I need to be certain whatever method I run with can cope.

Keep the suggestions coming please.

C

Using CRON:-

  • Create a job to download/update the XML data.
  • Create a job to import the XML data into your database.

The frequency would obviously depend on how often the XML data is subject to change.

Hi Anthony

That is pretty much the direction I was heading in, but thanks for confirming that it is the best solution.

Best get coding then!

regards

Colin

Hi

Each property has a date column in the DB. Is it possible to structure my query to only update the rows with a different date? The date format is 2010-07-22 10:25:28.

But then if a property is deleted from the feed it would not be removed from the DB using this method, unless of course I somehow checked for its existence in the feed and deleted it if it was missing.

Perhaps it would be best to just overwrite the DB every time?

Thoughts on this :confused:

Colin