Handling massive dataset, memory optimization

data

#1

I am working on a custom WP plugin which in short simply loads products from a manufacturer’s website API and insert those products (with the needed changes) into WP and WooCommerce.

The issue I am facing is that on the staging server (with identical specifications to the production server), the plugin runs out of memory while receiving the details from the API.

Unfortunately this API can’t give paged results, and simply gives a dump (containing about 9.3MB of json data; which equates to over 8500 product variations; the 9.3mb is a dump file created of the raw returned data). During the receiving of this data, the PHP memory limit (of 128mb) is reached - I might not be able to increase this limit (busy trying but as an alternative).

The API is a SOAP type, and trying to use less memory I tried not declaring a variable first before sending it to the further processing function (though further reading concluded that this does not use more or less memory).

The failing part looks like this:

$this->translate_data( 
    $this->SOAPClient->GetFeedByEntityID( $params )->GetFeedByEntityIDResult->ResponseData );

The “->GetFeedByEntityIDResult->ResponseData” is to get to the response data as soon as possible without declaring more variables.

The function in which this is located terminates after dumping the translated (further processed data) to a file (where a 3rd part of the script handles it as the actual import to WC - this 3rd part is modular to also handle data from other manufacturer APIs; the translate_data-method simply gets it into a usable format).

Should I perhaps try to handle the translate_data method inside the current method (to unset the data asap)? What would you suggest?

I have very limited experience with this size datasets, especially ones that can’t be paged.


#2

A 9.5mb xml file should not be an issue, the issue will likely be the way the XML is generated. If you are running out of memory it probably means that all the records are being loaded at once.

Roughly, here’s what to do to fix it:

  • Create the XML file with any relevant metadata tags, write it to disk
  • Query N records
  • Generate the XML and append it to the file
  • Repeat for remaining records
  • Write any closing tags and serve the file to the browser
  • (Probably cache the file for N minutes and regenerate it only after that time has elapsed)

#3

Thank you for the quick response.

That is where the issue comes in. The API is on a remote server and on the “GetFeedByEntityID()” method, it outputs the data as a single JSON response. As far as I can see, the amount of memory used to read this response is what is killing it (having dumped the response to a text file directly after receiving it).

I can’t choose to only receive the first 5000 items in the object, and later do the next 5000.

While I can set the output to xml, it will still provide me with the same data, just in xml - I decided JSON should be easier to handle, and this does not yet decode the JSON data, this is only reading it into a string.


#4

Ah sorry, I thought you were trying to generate the XML file, not parse it.

There are a few options. You could write a custom JSON parser using fread to read sections of the document at a time (e.g. find the start of one record and keep reading until then end, process the data, then read the next part of the file).

A simpler approach might be to chunk the file yourself. A similar problem to above, but rather than writing a full on parser, if you detect where an individual record ends scan the file for say 50 records, then write them to their own .json file and continue until you reach the end.

Alternatively, increasing PHP’s memory limit will probably be the easiest approch.


#5

Is it possible to use Curl to download the file?


#6

Have you tried a streaming JSON parser? Instead of decoding the entire file at once it will decode it chunk by chunk, saving a lot of memory.

See for example https://github.com/salsify/jsonstreamingparser/blob/master/README.md

It requires a file but you could use an in memory writer as well by using something like

$fp = fopen('php://memory', 'r+');
fputs($fp, $jsonData);
rewind($fp);

// now pass $fp to the decoder

#7

I have now contacted the hosting company to hopefully get more memory.

While I would like to be able to handle this, it seems to be impossible due to:

  • I don’t have access to the request resource (like I would have if it was a simple REST API, compared to the SOAP API) to simply read it in section by section (or even using the steaming JSON parsing), and I have no idea how to convert a SOAPClient method to a curl request.
  • The script reaches the memory limit before giving the received string to my own code, so this is before I even start decoding the data.

I therefore concluded that I can only try to reduce memory usage by other objects, variables etc, which is very limited (since my plugin needs to run after WC have initialized, to get access to WC functions, Classes etc, by that time other plugins have also initialized). I developed this plugin on a local Vagrant VVV and it works on that setup since it is a plain WP install with WC and my plugin (though even then I can see I am close to the limit at that time - after that, memory usage drops quite far, since I could optimize it, destroying any variables I no longer need and running through the JSON array using array_pop - yes I process it in reverse, but the resulting file is also processed in reverse and this allows for a few needed calculations which would be hard running forwards).

Thank you all for the suggestions.


#8

This is pretty simple:

  1. Download the file with CURL (
    https://stackoverflow.com/questions/6409462/downloading-a-large-file-using-curl )
  2. Have your script open the downloaded local file rather than a remote URL.

#9

You can’t just curl a SOAP request as SOAP is a very specific XML DPC protocol, so while you can download the XML you’d still need to parse that.

Honestly, the only viable options I see is to either ask the supplier to implement paging, or to get the memory limit increased. Anything else would be accidental complexity.


#10

I recently had some file size problems trying to use large CSV format census data files in QGIS. Some worked by opening them in LibreOffice, removing unused columns, and then saving and using the resultant smaller file. For others I needed to open them in Notepad++ and save them in smaller files segmented by rows, open in LibreOffice, remove unused columns, save, and reassemble into a single file.

The point is. if the data is too much for what you want to use it with directly, try to find something intermediary that can deal with it and reduce it as much as you can.

Other than that, avoid globals whenever possible and try to force Garbage Collection as soon as you’re done with something. eg. unset($large_array) or $large_array = null


#11

SOAP uses HTTP. You can create the HTTP request with CURL (might be a bit more work as you’d need to generate the XML request) and save the file.

Once you’ve got the file you can slice it up. Use a strem to open it, search through and find 100x </m:closingTag> and save them to their own file. Repeat until you run out of records.

It’s a bit of a pain but doable. You could do something similar with HTTP byte ranges if the server supported it (though you’d have to keep sending the request).


#12

Try setting error_reporting(-1); and ini_set(‘display_errors’, ‘1’); before calling the problematic script and make a note of the type of memory which is being exhausted. Once you know the memory type, check the online PHP user manual and see if it is possible to increase the memory at run time using ini_set('memory-type, value);


#13

Like I said: accidental complexity :wink: