SimpleXMLElement's capacity

Dear All

I would like to know is there any limitation in the size of xml that SimpleXMLElement can handle effectively.

Can it handle xml of size 50MB?
If not which xml parse should i use?

Thanks in advance for any suggestions.

Great Anthony!!
That works :slight_smile:

But i have one problem. Since i have thousands of data. Populating as an objects first and looping thereafter may consume time.

It would be more effective if we can operate on the nodes during reading using XMLReader.

Any idea?

Really appreciate your help.

Thanks

Hi Anthony
I tried to use your code. But it’s showing double empty values.
Any idea??

Ooo, this looks like fun. I’ll fire up the IDE.

Saying that, I reckon Salathe will beat me to it. :smiley:

The XML snippet looks OK to me

<?xml version="1.0" encoding="UTF-8"?>
<bronboek_basic_xml docformat="1.0">
<catalog>
    <product>
		<isbn>0.25</isbn>
		<auteur>KAART</auteur>
		<titel>DIV.  KAARTEN</titel>
		<levcode>SABRA</levcode>
		<editie></editie>
		<pagina></pagina>
		<nur>0</nur>
		<gewicht>0</gewicht>
		<prijs>0,25</prijs>
		<adatum>13-3-2003</adatum>
		<eenheid></eenheid>
		<catsoort>2</catsoort>
		<boeksoort>10</boeksoort>
		<berichtcode>0</berichtcode>
		<bindcode>0</bindcode>
		<btwcode>2</btwcode>
		<btwmode>1</btwmode>
	</product>
......................

True, 700 lines is more than a few, but with some determination looking at them in a syntax highlighted editor you should spot the problem. eg.

<?xml version="1.0" encoding="UTF-8"?>
<bronboek_basic_xml docformat="1.0">
<catalog>
    <product>
		<isbn>0.25</isbn>
		<auteur><![CDATA[KAART & KAART]]></auteur>
		<titel>DIV.  KAARTEN & KAARTEN</titel>
		<levcode>SABRA</levcode>
		<editie></editie>
......................

^ I’m wondering if something needs to be inside CDATA ??

I think i should go for XMLReader.
Can anybody help me to read the following xml with XMLReader:

<?xml version="1.0" encoding="UTF-8"?>
<bronboek_basic_xml docformat="1.0">
<catalog>
    <product>
        <isbn>0.25</isbn>
        <auteur>KAART</auteur>
        <titel>DIV.  KAARTEN</titel>
        <levcode>SABRA</levcode>
        <editie></editie>
        <pagina></pagina>
        <nur>0</nur>
        <gewicht>0</gewicht>
        <prijs>0,25</prijs>
        <adatum>13-3-2003</adatum>
        <eenheid></eenheid>
        <catsoort>2</catsoort>
        <boeksoort>10</boeksoort>
        <berichtcode>0</berichtcode>
        <bindcode>0</bindcode>
        <btwcode>2</btwcode>
        <btwmode>1</btwmode>
    </product>
......................
</catalog>
</bronboek_basic_xml>

I want to read isbn, auteur, titel etc of product node.

I found it a bit difficult with XMLReader.

I have tried the following approach:

$xml_reader = new XMLReader();
$xml_reader->XML($xml_string);
while($xml_reader->read()){
  
    if($xml_reader->name == "catalog" && $xml_reader->nodeType == XMLReader::ELEMENT){
       
        while($xml_reader->read()){
            echo $xml_reader->name . '<br />';
        }        
    }    

}

Can anybody suggest the proper way of getting product node values?

Thanks

AFAIK SimpleXML loads the entire XML into memory (tree rather than event).

So if your ini settings for memory are too low it won’t work, although I would guess you would get a memory error. But I don’t know, I never tried with a large XML file.

That error message suggests the XML isn’t well formed, but I suppose that could happen if the memory shut down and truncated the file prematurely.

Can you throw more memory at it?

With a file that size, IMHO if you don’t need to work with the DOM a SAX parser would be better.

I didn’t get your point Anthony :slight_smile:

Can you suggest me how to read above XML using XMLReader?

any help is much appreciated.

Thanks

EDIT:
Thanks for the code. I will try it and let you know.

Strange.

So far, I have…


<?php
error_reporting(-1);
ini_set('display_errors', true);

function load_xml($file){
  $reader = new XMLReader();
  $reader->open($file);
  return $reader;
}

$document = load_xml('products.xml');

while($document->read()){
  if('product' === $document->name && $document->nodeType === XMLReader::ELEMENT){
    while($document->read()){
      if('product' === $document->name && $document->nodeType === XMLReader::END_ELEMENT){
        break;
      }
      printf("&#37;s = %s\
", $document->name, $document->value);
      /*
        isbn = 
        #text = 0.25
        isbn = 
        #text = 
        
        auteur = 
        #text = KAART
        auteur = 
        #text = 
      */
    }
  }
}

?>

As you can see, it’s giving me repeating elements, and I cannot see why. I’m going to grab a coffee and come back to it in 5 minutes.

Nope. It supports namespaces just fine, at least the reading of them anyway.

Here it goes the sample:

<?xml version="1.0" encoding="UTF-8"?>
<bronboek_basic_xml docformat="1.0">
<catalog>
    <product><isbn>0.25</isbn><auteur>KAART</auteur><titel>DIV.  KAARTEN</titel><levcode>SABRA</levcode><editie></editie><pagina></pagina><nur>0</nur><gewicht>0</gewicht><prijs>0,25</prijs><adatum>13-3-2003</adatum><eenheid></eenheid><catsoort>2</catsoort><boeksoort>10</boeksoort><berichtcode>0</berichtcode><bindcode>0</bindcode><btwcode>2</btwcode><btwmode>1</btwmode></product>
......................

Thanks

Can you post or attach a small portion of the XML file?

I’m too tired to remember. Doesn’t SimpleXML choke on namespaces?

More research:
I used the following code:

  libxml_use_internal_errors(true);
           $library    = simplexml_load_string($large_xml_string);
           if (!$library) {
                echo "Failed loading XML<br />";
                foreach(libxml_get_errors() as $error) {
                    echo $error->message . '<br />';
                }
            }

And got the following errors:

Failed loading XML
StartTag: invalid element name
StartTag: invalid element name
error parsing attribute name
attributes construct error
Couldn’t find end of Start Tag Eagle line 700
Input is not proper UTF-8, indicate encoding ! Bytes: 0x89 0x6E 0x73 0x3C

Hope this helps you to jott somethings.

Thanks all for the great responses.

I also tried to add the following code at the top:
set_time_limit(0);
ini_set(‘memory_limit’, ‘555555M’);

but still the same error.

May be this is due to limitation in simple xml parsing model.
May be i should look at XMLReader once.

Thanks

On that note, there’s always XMLReader.

Going by the code snippet in post #10, each <product> is on a single line.

Hence

Couldn’t find end of Start Tag Eagle line 700

And the fact that XMLReader is choking at 695 suggest that you still have an XML error in that area of the file.

Double check that area again or post it here if you can’t see anything obviously wrong with it.

Thanks a lot AnthonySterling.
I got the XMLReader working.
But…

It was only able to read 695 products only, though we have around 50,000 products.
What can be the cause, is there any flag to set for large xml in XMLReader?

Thanks

It works, but I bloody hate it. :smiley:


<?php
error_reporting(-1);
ini_set('display_errors', true);

function load_xml($file){
  $reader = new XMLReader();
  $reader->open($file);
  return $reader;
}

$document = load_xml('products.xml');

$products = array();

while($document->read()){
  if('product' === $document->name && $document->nodeType === XMLReader::ELEMENT){
    $product = new stdClass;
    while($document->read()){
      if('product' === $document->name && $document->nodeType === XMLReader::END_ELEMENT){
        array_push($products, $product);
        break;
      }
      switch($document->nodeType){
        case XMLReader::ELEMENT:
          $property = $document->name;
          $product->{$property} = '';
        break;
        case XMLReader::TEXT:
          if(null !== $property){
            $product->{$property} = $document->value;
            $property = null;
          }
        break;
      }
    }
  }
}

print_r(
  $products
);

/*
  Array
  (
      [0] => stdClass Object
          (
              [isbn] => 0.25
              [auteur] => KAART
              [titel] => DIV.  KAARTEN
              [levcode] => SABRA
              [editie] => 
              [pagina] => 
              [nur] => 0
              [gewicht] => 0
              [prijs] => 0,25
              [adatum] => 13-3-2003
              [eenheid] => 
              [catsoort] => 2
              [boeksoort] => 10
              [berichtcode] => 0
              [bindcode] => 0
              [btwcode] => 2
              [btwmode] => 1
          )
      [1] => stdClass Object
          (
              [isbn] => 0.25
              [auteur] => KAART
              [titel] => DIV.  KAARTEN
              [levcode] => SABRA
              [editie] => 
              [pagina] => 
              [nur] => 0
              [gewicht] => 0
              [prijs] => 0,25
              [adatum] => 13-3-2003
              [eenheid] => 
              [catsoort] => 2
              [boeksoort] => 10
              [berichtcode] => 0
              [bindcode] => 0
              [btwcode] => 2
              [btwmode] => 1
          )
  )
*/
?>

Some of those errors are pretty clear. You could try and fix those.

I think we should operate on the following code:

if('product' === $document->name && $document->nodeType === XMLReader::END_ELEMENT){
        print_r($product);
        break;
      }

am i rite?