I am facing a challenge whereby an xml file’s content has to be stored in a table, there is some computing to be done between the reading and the storing.
This table has to be refreshed each week from the latest xml source its size is ~1Meg.
There are less than 1000 ‘rows’ in the xml file, the UID of each record being the title of an organisation.
The best guess is that there would be say, 5 diffs between last weeks xml file and this weeks. For argument sake lets say that is 1 new entry, 1 entry deleted and 3 entries whose contents will have somehow changed.
Essentially I have been asked to delete the old data, and re-insert the new - which seems wasteful, but quite easily do-able…
… until you consider that one of the differences between the xml file and the database table has to be computed as follows:
<record>
<title>My Org</title>
<address>1 Big Street</address>
<town>Rochdale</postcode>
<postcode>BL9 8PP</postcode>
.... and so on ...
</record>
stored as:
organisations
==========
id
title
address
town
postcode
lat_lng **
** so here I have to curl out to google maps and do an address lookup and grab the lat,lng. From experience this goes wrong ~1 to 2% of the time,
So now its gotten slightly more complicated, because there are latency issues and limits on google api use and so on.
There are some other complexities which I am happy to describe, but the above describes the nub of the problem.
Thanks for reading this far but perhaps you can give some thoughts on my dilemmas, which include.
Should I check for diffs between this weeks xml and last weeks xml and just process them?
-If I do take that route, what are the best tools to use for that job? (either PHP-based or *nix tools)
-As this is a weekly event run overnight not serving a global audience, perhaps I could load both xml files in memory using simplexml?
-Perhaps I could split the xml feed into records and store them in as raw text in a weekly table and see if an update changes anything? Would that work as a means of identifying diffs?
Should I just stick to the spec and wipe out everything and start afresh every week?
Is there another way of looking at this problem, or perhaps you have had to deal with similar?
I really want a solution that I can unit test because they feel that the xml <-> table mapping will evolve and change over time, and the potential for this system to break down and one of any number of flex points seems very high.
For that reason, I be happy to hear your recommendation about which Patterns might provide me with easy to manage solutions for the future.
Cheers!
(PHP 5.3.2, Ubuntu 4.9)