Parsing a large (1+GB) nested XML file

MainArea · December 27, 2010, 6:35am

I have a 1+GB XML file that I need to parse into a database. The XML file has many nested elements - I’ve posted a sample of the XML file at http://pastebin.com/7Wzzaxg1 (but this XML file continues for hundres of thousands of rows).

What I need to do is to export all of this data from the XML file into a database. I’m trying to figure out how to process segments to export - for example, on one run-through, I’ll want to extract the child elements of Album (id, name, sample url, upc, artist ID, label id, category id). On another run-through, I’ll want to grab all Artist data (id, name, url). On yet another run-through, I’ll need the “data” element in addition to the album ID to which it belongs (Album ID, Data ID, Data Name, Data Sample URL).

Unfortunately, since this file is so huge, I’m unable to use SimpleXML parsing - I’m forced to use XMLReader (which streams an input file by default) or xml_parse/fopen (such as in this example http://www.ustrem.org/en/articles/large-xml-files-in-php-en/ ). I can’t seem to figure how to handle these nested elements easily, though. For example - since there are a number of nodes called “name” on different levels, I’m often returning too many results or incorrect tags.

Does anyone have any suggestions on how to handle this parsing?

AnthonySterling · December 27, 2010, 11:01am

This thread might be of interest, I never did get round to cleaning the code up but the mechanics is there.

MainArea · December 28, 2010, 12:45am

Thanks, Anthony - that code is helpful. The main elements within the <album> tab get extracted into the array, but the nested elements is where I run into trouble. For example, an array element “artist” is created but is blank.

I’m going to try to play with the code to step through the nested elements as well, but unfortunately, that’s where I’ve been having issues.

Thanks again for the help.

MainArea · December 28, 2010, 1:41am

Actually, the code isn’t working correctly. The code is setting the array value to the last instance found within an XML element.

For example, If <album> has an element named URL, and <data> (nested element underneath <album> and the last element found within the album element - see pastebin link above) also has an element named URL, the array will grab the value of URL within <data> rather than the value of URL within <album>.

Time to play more with this…

team1504 · December 28, 2010, 2:43am

That looks and sounds like a large and complex XML file. Upon reading I read that it was an album and I can understand the reason behind it’s complexity.
props to you for writing it and good luck managing the XML data

MainArea · December 28, 2010, 5:25am

So, I was able to write some pretty awful code to get the elements - way too sloppy/awful to post just yet. I may eventually post it after cleaning a little bit - it’s super slow, but speed isn’t a concern for me. It also ends up somehow loading the whole file in RAM rather than streaming it based on the tweaks that I made. Thankfully, I’ve got a Mac Pro w/5GB of RAM, so it actually works with RAM + swap space.

Unfortunately, after running through 270,000 records, I’m now getting this error:

PHP Warning: XMLReader::read(): An Error Occured while reading in
/Volumes/Media/music/phpfiles/parse-albums.php on line 19
PHP Warning: XMLReader::read(): An Error Occured while reading in /Volumes/Media/music/phpfiles/parse-albums.php on line 15

From my quick searching, it looks like there’s probably an invalid character or something there. I don’t control the data source, so this could turn into a mess attempting to track down every time this happens until it finally gets through all 5GB of data…

AnthonySterling · December 28, 2010, 12:43pm

OK, here’s what I have so far.


<?php
error_reporting(-1);
ini_set('display_errors', true);

function get_reader($file){
  $reader = new XMLReader;
  $reader->open($file);
  return $reader;
}

function handle_album(Album $album){
  /*
    This gets called everytime an album node
    has been iterated.
  */
}

class Album
{
  public
    $artist,
    $label,
    $category,
    $data;

  public function __construct(){
    $this->artist   = new stdClass;
    $this->label    = new stdClass;
    $this->category = new stdClass;
    $this->data     = array();
  }
}

$xml = get_reader('php/xml.xml');

while($xml->read()){

  $isNewAlbum = 'album' === $xml->name && $xml->nodeType === XMLReader::ELEMENT;

  if($isNewAlbum){

    $album = new Album;
    
    #TODO

  }

  $isCompleteAlbum = 'album' === $xml->name && $xml->nodeType === XMLReader::END_ELEMENT;

  if($isCompleteAlbum){
    handle_album($album);
  }

}
?>

I’ll do a little more after lunch.

AnthonySterling · December 28, 2010, 1:36pm

I figured out a better, cleaner way; although it does still need some tweaking. You shouldn’t have any issues with memory either (bonus!), let me know if you have any questions.


<?php
error_reporting(-1);
ini_set('display_errors', true);

function get_reader($file){
  $reader = new XMLReader;
  $reader->open($file);
  return $reader;
}

function handle_album(SimpleXMLElement $album){
  /*
    This gets called everytime an album node
    has been iterated.
  */
  printf(
    "(%d) %s - %s\
",
    $album->id,
    $album->name,
    $album->url
  );
}

$xml = get_reader('php/xml.xml');

while($xml->read()){
  $isNewAlbum = 'album' === $xml->name && $xml->nodeType === XMLReader::ELEMENT;
  if($isNewAlbum){
    $doc = new DOMDocument('1.0', 'UTF-8');
    handle_album(
      simplexml_import_dom($doc->importNode($xml->expand(), true))
    );
  }
}

/*
  (123456) Name1 - http://www.site.com/url1
  (6665) Name2 - http://www.site.com/2url1
*/

Good luck!

Anthony.

MainArea · December 29, 2010, 7:13am

Anthony,

That code is amazing - uses very little RAM and works very well. For the <data> node, as there are multiple instances for each <album>, I used the following code in the handle_album function - this iterates through each <data> child contained within <album>


$album_id = $album->id;

foreach ($album->data as $child){
	//print_r($child);
	printf("\\"%s\\",\\"%s\\",\\"%s\\",\\"%s\\"\
",
		$child->id,
		$child->name,
		$child->sample_url,
		$album_id
		);
}
//Prints "DataID","DataName","DataSampleURL","AlbumID"

Thanks again for the huge help!

AnthonySterling · December 29, 2010, 11:26am

You’re welcome.

The iteration looks good, I’m glad you’ve sorted it.

See you around,

Anthony.

salathe · December 29, 2010, 2:28pm

AnthonySterling:


$xml = get_reader('php/xml.xml');

while($xml->read()){
  $isNewAlbum = 'album' === $xml->name && $xml->nodeType === XMLReader::ELEMENT;
  if($isNewAlbum){
    $doc = new DOMDocument('1.0', 'UTF-8');
    handle_album(
      simplexml_import_dom($doc->importNode($xml->expand(), true))
    );
  }
}

This part of the script could be tidied up a bit by not creating a new DOMDocument for each album. XMLReader can expand a node into an existing DOMDocument context (this is undocumented, but available as of PHP 5.3.0), which would look like:


$xml = get_reader('php/xml.xml');
$doc = new DOMDocument; // Added
while($xml->read()){
  $isNewAlbum = 'album' === $xml->name && $xml->nodeType === XMLReader::ELEMENT;
  if($isNewAlbum){
    handle_album(
      simplexml_import_dom($xml->expand($doc)) // Tidied
    );
  }
}

AnthonySterling · December 29, 2010, 2:33pm

Brilliant, that’s good to know. Thanks for the feedback Salathe.