Xml to database - any ideas or pointers?

I am facing a challenge whereby an xml file’s content has to be stored in a table, there is some computing to be done between the reading and the storing.

This table has to be refreshed each week from the latest xml source its size is ~1Meg.

There are less than 1000 ‘rows’ in the xml file, the UID of each record being the title of an organisation.

The best guess is that there would be say, 5 diffs between last weeks xml file and this weeks. For argument sake lets say that is 1 new entry, 1 entry deleted and 3 entries whose contents will have somehow changed.

Essentially I have been asked to delete the old data, and re-insert the new - which seems wasteful, but quite easily do-able…

… until you consider that one of the differences between the xml file and the database table has to be computed as follows:


<record>
<title>My Org</title>
<address>1 Big Street</address>
<town>Rochdale</postcode>
<postcode>BL9 8PP</postcode>
.... and so on ...
</record>

stored as:


organisations
==========
id
title
address
town 
postcode
lat_lng **

** so here I have to curl out to google maps and do an address lookup and grab the lat,lng. From experience this goes wrong ~1 to 2% of the time,

So now its gotten slightly more complicated, because there are latency issues and limits on google api use and so on.

There are some other complexities which I am happy to describe, but the above describes the nub of the problem.

Thanks for reading this far but perhaps you can give some thoughts on my dilemmas, which include.

Should I check for diffs between this weeks xml and last weeks xml and just process them?

  -If I do take that route, what are the best tools to use for that job? (either PHP-based or *nix tools)

  -As this is a weekly event run overnight not serving a global audience, perhaps I could load both xml files in memory using simplexml?

  -Perhaps I could split the xml feed into records and store them in as raw text in a weekly table and see if an update changes anything? Would that work as a means of identifying diffs?

Should I just stick to the spec and wipe out everything and start afresh every week?

Is there another way of looking at this problem, or perhaps you have had to deal with similar?

I really want a solution that I can unit test because they feel that the xml <-> table mapping will evolve and change over time, and the potential for this system to break down and one of any number of flex points seems very high.

For that reason, I be happy to hear your recommendation about which Patterns might provide me with easy to manage solutions for the future.

Cheers!

(PHP 5.3.2, Ubuntu 4.9)

Regarding the storage of the data, could you not add DATE column (defaulting to NULL) indicating when this record was “deleted” ? With this column, you could then filter out only active records, and then after a predetermined period remove it completely.

If the record is re-inserted (caught by the dup key) you can just set that column as NULL again.

I’ve been thinking about this exact problem this week and came up with the following pseudo code (it’s very crude, but I think the idea is clear. if not, just holler)

ids=array()

foreach (nodes as node)
{
	ids.push(node.id)
	node.sha1=sha1(node)
	if (db.query('SELECT COUNT(*) FROM my_table WHERE sha1=' + node.sha1) == 0)
	{
		preprocess(node)
		// node is not in database, so it's either new or an existing row has changed
		if (oldnode = db.query('SELECT field1, field2, etc FROM myTable WHERE id=' + node.id))
		{
			// existing node changed
			db.delete(oldnode)
			db.insert(node)
			// or calculate the difference between oldnode and node and UPDATE
			// this would be beneficial if the lat and lang didn't change
			// because it would save you a call to the gmaps api
		}
		else
		{
			// new node
			db.insert(node)
		}
	}
	// else node exists, nothing to be done here
}
rows=db.query('SELECT id FROM myTable WHERE id NOT IN (' + implode(',', ids) + ')')
foreach(rows as row)
{
	db.delete(row)
}

I haven’t tested this yet, but I think it should hold. That is, I can’t think of a situation where it wouldn’t work.

I was thinking something like…


<?php
class XmlEntityMapper
{
  public function insert(SimpleXMLElement $record){
    try{
      $stmt = $this->db->prepare(
        'INSERT INTO table (title, postcode) VALUES (?, ?);',
        array($record->title, $record->postcode)
      );
      return $stmt->exec();
    }catch(DatabaseDuplicate $exception){
      return $this->update($record);
    }catch(Database $exception){
      throw new MapperException;
    }
  }
  
  public function update(SimpleXMLElement $record){
    $stmt = $this->db->prepare(
      'UPDATE table SET postcode = ?, deleted_on = ? WHERE title = ? LIMIT 1;',
      array($record->postcode, $record->deleted_on, $record->title)
    );
    return $stmt->exec();
  }
  
  public function delete(SimpleXMLElement $record){
    $record->deleted_on = date(DATE_ISO8601);
    return $this->update($record);
  }
}

Thank you both for your ideas.

I’ll try and mini-model this a bit further, there are other complexities which might sway things - these might be better explained in code.

Thus far I just checking strategies (in my head) - more fun than doing the sunday crossword - it helps a lot just seeing the problem in writing.

Try yahoo placefinder instead, their API has 50K per day limit and does not have such a restriction last time I checked.