SitePoint Sponsor

User Tag List

Results 1 to 12 of 12
  1. #1
    SitePoint Addict
    Join Date
    Feb 2001
    Location
    -
    Posts
    389
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Parsing a large (1+GB) nested XML file

    I have a 1+GB XML file that I need to parse into a database. The XML file has many nested elements - I've posted a sample of the XML file at http://pastebin.com/7Wzzaxg1 (but this XML file continues for hundres of thousands of rows).

    What I need to do is to export all of this data from the XML file into a database. I'm trying to figure out how to process segments to export - for example, on one run-through, I'll want to extract the child elements of Album (id, name, sample url, upc, artist ID, label id, category id). On another run-through, I'll want to grab all Artist data (id, name, url). On yet another run-through, I'll need the "data" element in addition to the album ID to which it belongs (Album ID, Data ID, Data Name, Data Sample URL).

    Unfortunately, since this file is so huge, I'm unable to use SimpleXML parsing - I'm forced to use XMLReader (which streams an input file by default) or xml_parse/fopen (such as in this example http://www.ustrem.org/en/articles/la...les-in-php-en/ ). I can't seem to figure how to handle these nested elements easily, though. For example - since there are a number of nodes called "name" on different levels, I'm often returning too many results or incorrect tags.

    Does anyone have any suggestions on how to handle this parsing?

  2. #2
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    This thread might be of interest, I never did get round to cleaning the code up but the mechanics is there.
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  3. #3
    SitePoint Addict
    Join Date
    Feb 2001
    Location
    -
    Posts
    389
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks, Anthony - that code is helpful. The main elements within the <album> tab get extracted into the array, but the nested elements is where I run into trouble. For example, an array element "artist" is created but is blank.

    I'm going to try to play with the code to step through the nested elements as well, but unfortunately, that's where I've been having issues.

    Thanks again for the help.

  4. #4
    SitePoint Addict
    Join Date
    Feb 2001
    Location
    -
    Posts
    389
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Actually, the code isn't working correctly. The code is setting the array value to the last instance found within an XML element.

    For example, If <album> has an element named URL, and <data> (nested element underneath <album> and the last element found within the album element - see pastebin link above) also has an element named URL, the array will grab the value of URL within <data> rather than the value of URL within <album>.

    Time to play more with this...

  5. #5
    SitePoint Guru team1504's Avatar
    Join Date
    May 2010
    Location
    Okemos, Michigan, USA
    Posts
    732
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    That looks and sounds like a large and complex XML file. Upon reading I read that it was an album and I can understand the reason behind it's complexity.
    props to you for writing it and good luck managing the XML data

  6. #6
    SitePoint Addict
    Join Date
    Feb 2001
    Location
    -
    Posts
    389
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    So, I was able to write some pretty awful code to get the elements - way too sloppy/awful to post just yet. I may eventually post it after cleaning a little bit - it's super slow, but speed isn't a concern for me. It also ends up somehow loading the whole file in RAM rather than streaming it based on the tweaks that I made. Thankfully, I've got a Mac Pro w/5GB of RAM, so it actually works with RAM + swap space.

    Unfortunately, after running through 270,000 records, I'm now getting this error:

    PHP Warning: XMLReader::read(): An Error Occured while reading in
    /Volumes/Media/music/phpfiles/parse-albums.php on line 19
    PHP Warning: XMLReader::read(): An Error Occured while reading in /Volumes/Media/music/phpfiles/parse-albums.php on line 15


    From my quick searching, it looks like there's probably an invalid character or something there. I don't control the data source, so this could turn into a mess attempting to track down every time this happens until it finally gets through all 5GB of data...

  7. #7
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    OK, here's what I have so far.
    PHP Code:
    <?php
    error_reporting
    (-1);
    ini_set('display_errors'true);

    function 
    get_reader($file){
      
    $reader = new XMLReader;
      
    $reader->open($file);
      return 
    $reader;
    }

    function 
    handle_album(Album $album){
      
    /*
        This gets called everytime an album node
        has been iterated.
      */
    }

    class 
    Album
    {
      public
        
    $artist,
        
    $label,
        
    $category,
        
    $data;

      public function 
    __construct(){
        
    $this->artist   = new stdClass;
        
    $this->label    = new stdClass;
        
    $this->category = new stdClass;
        
    $this->data     = array();
      }
    }

    $xml get_reader('php/xml.xml');

    while(
    $xml->read()){

      
    $isNewAlbum 'album' === $xml->name && $xml->nodeType === XMLReader::ELEMENT;

      if(
    $isNewAlbum){

        
    $album = new Album;
        
        
    #TODO

      
    }

      
    $isCompleteAlbum 'album' === $xml->name && $xml->nodeType === XMLReader::END_ELEMENT;

      if(
    $isCompleteAlbum){
        
    handle_album($album);
      }

    }
    ?>
    I'll do a little more after lunch.
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  8. #8
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    I figured out a better, cleaner way; although it does still need some tweaking. You shouldn't have any issues with memory either (bonus!), let me know if you have any questions.

    PHP Code:
    <?php
    error_reporting
    (-1);
    ini_set('display_errors'true);

    function 
    get_reader($file){
      
    $reader = new XMLReader;
      
    $reader->open($file);
      return 
    $reader;
    }

    function 
    handle_album(SimpleXMLElement $album){
      
    /*
        This gets called everytime an album node
        has been iterated.
      */
      
    printf(
        
    "(%d) %s - %s\n",
        
    $album->id,
        
    $album->name,
        
    $album->url
      
    );
    }

    $xml get_reader('php/xml.xml');

    while(
    $xml->read()){
      
    $isNewAlbum 'album' === $xml->name && $xml->nodeType === XMLReader::ELEMENT;
      if(
    $isNewAlbum){
        
    $doc = new DOMDocument('1.0''UTF-8');
        
    handle_album(
          
    simplexml_import_dom($doc->importNode($xml->expand(), true))
        );
      }
    }

    /*
      (123456) Name1 - http://www.site.com/url1
      (6665) Name2 - http://www.site.com/2url1
    */
    Good luck!

    Anthony.
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  9. #9
    SitePoint Addict
    Join Date
    Feb 2001
    Location
    -
    Posts
    389
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Anthony,

    That code is amazing - uses very little RAM and works very well. For the <data> node, as there are multiple instances for each <album>, I used the following code in the handle_album function - this iterates through each <data> child contained within <album>
    PHP Code:
    $album_id $album->id;

    foreach (
    $album->data as $child){
        
    //print_r($child);
        
    printf("\"%s\",\"%s\",\"%s\",\"%s\"\n",
            
    $child->id,
            
    $child->name,
            
    $child->sample_url,
            
    $album_id
            
    );
    }
    //Prints "DataID","DataName","DataSampleURL","AlbumID" 
    Thanks again for the huge help!

  10. #10
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    You're welcome.

    The iteration looks good, I'm glad you've sorted it.

    See you around,

    Anthony.
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  11. #11
    @php.net Salathe's Avatar
    Join Date
    Dec 2004
    Location
    Edinburgh
    Posts
    1,396
    Mentioned
    54 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by AnthonySterling View Post
    PHP Code:
    $xml get_reader('php/xml.xml');

    while(
    $xml->read()){
      
    $isNewAlbum 'album' === $xml->name && $xml->nodeType === XMLReader::ELEMENT;
      if(
    $isNewAlbum){
        
    $doc = new DOMDocument('1.0''UTF-8');
        
    handle_album(
          
    simplexml_import_dom($doc->importNode($xml->expand(), true))
        );
      }

    This part of the script could be tidied up a bit by not creating a new DOMDocument for each album. XMLReader can expand a node into an existing DOMDocument context (this is undocumented, but available as of PHP 5.3.0), which would look like:

    PHP Code:
    $xml get_reader('php/xml.xml');
    $doc = new DOMDocument// Added
    while($xml->read()){
      
    $isNewAlbum 'album' === $xml->name && $xml->nodeType === XMLReader::ELEMENT;
      if(
    $isNewAlbum){
        
    handle_album(
          
    simplexml_import_dom($xml->expand($doc)) // Tidied
        
    );
      }

    Salathe
    Software Developer and PHP Manual Author.

  12. #12
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Brilliant, that's good to know. Thanks for the feedback Salathe.
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •