Efficient way to passing large objects between scripts?

cpradio · January 31, 2013, 12:38am

As a quick update, I then attempted 10000, but reached my set memory limit in PHP, so I dropped it to 4000 (the largest it would let me before reaching my memory limit), and that tool 13,074 ms, 61% in randomString (again on par with the other runs). However, read.php reaches the memory limit trying to read this file using file_get_contents(), so I might try using fopen and reading smaller chunks to see if I can’t get around this issue, but this may be a problem… as I’m not sure how I’m going to tackle that issue, as I can’t use json_decode on a partial json string…

Vali · January 31, 2013, 1:08am

cpradio Thank you for your time, I clearly need to run more tests.
I have tested with json/serialize/igbinary, and they do make a difference as you add more data/run them in parallel.

But you bring up a good point, maybe it’s slow because of the multi-dimensional arrays, I will have to verify that.

As a note on the memory, I have mine set to 124MB.
BUT since I get the data in chunks from the workers, I json_decode it (it spikes memory allot, then releases it), and then only keep the values I really need.
I also re-arrange the data in a way so I never have anything duplicated, so I end up with needing about 70MB instead of a few GB RAM.

BUT, having said that, there are instances where the worker returns more data than I can receive (needs to much memory to decode it), that is another reason I wanted to find a fix for this.
(But that happens less than 0.0023% of the time, so about twice per search, once every 20 or so searches when I factor in caching, and when it happens, I just order by price and truncate the data a bit…)

aaarrrggh Most sites only give you one or two airlines, they pre-cache the prices (which change, but they update it on the client, or absorb the price change) and check availability as you select your price.
Or, they load their data from a data source like the one I’m working on (there are many many layers… with sometimes mainframes in the lower levels).

cpradio · January 31, 2013, 1:56am

Okay, our memory limits are nearly identical, mine is set at 128M, so that is appropriate. As for your memory statement, can you elaborate how you decided you only need roughly 70MB?

One thing you can deduce from my read.php test is that since 98% of the processing time is in json_decode, the time you will see CPU and Memory spike is during that call. If you think about it, it really makes a lot of sense, you are loading all of the objects of the json objects into memory. If you are loading all of it, then wiping out portions because you don’t need it, you are still taking a hit up front because you don’t have the ability to filter that unnecessary data prior to calling json_decode.

At this point, I’d like to point out something that could be useful and that is designing your return data to mimic copybooks on the mainframe. A copybook is just a giant string. Each piece of data in the string is a fixed length located at a fixed position.

Example:

id   price    other
00001000560.34More Info

The header record could be optional, but the remainder you would then be able to use fopen() and fread to read a line at a time (as you could denote how long each record is), thus you load a single record in memory, grab the data you need and move on, filtering out the data you don’t need on a record by record basis.

Similarly, you could use a MongoDB table, SQLite table, or any other database for that matter to perform the same concept and you gain a better pre-made framework for sorting purposes.

One of the biggest issues with json_decode and unserialize is you have to load the entire dataset before you can process anything. That is harsh when the data is large and it can be costly (as you are finding). 900 processes each loading an entire dataset of 100+ objects (that may be 70M each) means you are trying to use 70M * 900 amounts of RAM. If the data was in a table/copybook style, you could filter out the stuff you don’t need in your query and maybe end up using 30-35M each, cutting your memory usage in half.

A couple years ago, I was tasked with processing 5-10 thousand policies (each about 1-2 MB in size) as quickly as possible, that meant loading them up, validating all of the data, ordering reports from third parties, and then processing business and validation rules against them, then finally providing a quote for each one. The process we built could run 1000-2500 policies an hour based on the thresholds we set. However we only got to that speed because we profiled our code weekly taking the highest costing portion and reworking it until it fell into a group of 3 or more functions that took equal amounts of time. Doing this weekly allowed us to catch any code that seemed to take longer than the rest and drop it so it fell in line with other methods, in the end no method was taking longer than the others, they all performed equally.

Now if all of your methods are performing badly, that isn’t much help, but chances are, they aren’t and a similar approach might be useful here. If you profile your code and see that the json calls are indeed where the most time is spent, then deviating from json is important to do and I’d recommend going to either a flat fixed file (like a copybook on the mainframe) or to a database so you can retrieve what you need without loading all of the data into PHP’s memory.

Vali · January 31, 2013, 2:13am

The copybooks format is a nightmare to maintain (and actually parsing something like that from 2 of the data sources).

The old system was passing data like this, but it’s a million and one times slower than the new one (bad code + slow parsing).

The 70mb is the result of the 900 workers (the data I really need), and to verify it, I just get the PHP starting memory, print memory_get_peak_usage and memory_get_usage for an idea (beside server stats and so on).

Also, one important note is tha I don’t concatenate the 900 json replies, I parse them 1 by 1, so I don’t have to holed EVERYTHING in memory, just the reply of one worker at a time.

For the database, I tried, I would have to make way to many inserts, same approach with a memcached cluster, it ended up being the bottleneck (send data there, to send it back to what started the worker).

So now, I return the data directly to whomever started the worker, and the load balancing starts the parent on a server that can handle the bandwidth.

cpradio · January 31, 2013, 10:39am

Yes, they are a pain to maintain (I can definitely agree with that), but slower? That seems a bit odd as the filesize should definitely have been smaller than the json footprint…at least I know you were already down this path now.

Ah, I mis-read that 70mb was the resulting memory footprint after getting rid of the data you didn’t need

My point was if there was anything in the returned json data for each reply that you do not use, you are loading it into memory during json_decode and then ditching it, but maybe that isn’t what you are doing here, maybe you are filtering out whole json results from some of the 900 replies. Can you elaborate on that process?

Interesting again, could have been useful to look into bulk inserting, but I’m willing to consider this as “tried and determined it was a bottleneck”

Lemon_Juice · January 31, 2013, 12:02pm

This has been an amazing read, I’m impressed by the magnitude of processing in this system, it makes my head spin when I try to imagine that

From my short test I found out that indeed json_encode() is faster than serialize() but also that when the size of the array doubles the time spent on json_encode() increases more than double, which suggests that the larger the input the worse the performance becomes.

I don’t see how processing such large amounts of data can be done efficiently, the json parser has to do its work and there are no shortcuts. There are two solutions that come to my mind:

Write a php extension in C that will do the serialization and unserialization. Your extension has the potential of being faster than the built-in functions because you most probably don’t need support for all datatypes and such so the code could be simplified as much as possible to suit your needs.
This is just speculation because I don’t have enough knowledge in PHP internals - but I think the best solution would be to get rid of serializing/endoing/decoding steps altogether. If all the servers exchanging data use php scripts then I suppose php must be storing those arrays in some internal (binary?) format in memory while the script it running so if you were somehow able to grab the internal representation of the array and pass it directly to another server which would be able to inject it into its own running script then you would save a lot of unnecessary processing of the data back and forth. This would at least require a php extension or even some digging and patching of the php code and making your custom php release - but if the project is of such a large scope then maybe it would be worth investigating such options.

cpradio · January 31, 2013, 12:25pm

Another aspect is if you are processing all 900 responses, you may be holding on to memory allocation that is no longer necessary and could be cleared up (either through garbage collecting, or disposing of it within your application – set them to null or use unset).

Another test that would be interesting is testing json_decode($data, true) versus json_decode($data, false). The latter returns an object the former an associative array.

Then along the lines of what Lemon Juice was suggesting you may want to consider HipHop for PHP which takes your PHP code and transforms it to C++ (there is also HPHPi which does all this at run-time too, but that to my knowledge hasn’t moved into production ready yet)

Lastly, is all of the processing happening through web protocols using php through apache or php cli? You may get some benefit by going to php cli, but I’m not certain on that.

KyleWolfe · January 31, 2013, 2:15pm

Crap. I had some stuff come up. But what I was going to get to was: Are any of the returns from the remote sources reusable?

KyleWolfe · January 31, 2013, 2:34pm

Is that format the only possible call you can make? From the speed your getting on the calls it might be possible to start building up a better cache of data on a nightly basis. I’m working on a similar project using MongoDB, which allows the storing of multidimensional data (json) quickly and painlessly. It has much better select performance than relational dbs (mysql, MSSQL).

2ndmouse · January 31, 2013, 2:50pm

Very interesting indeed. I happen to work for a company who provides logistics software for about 40 airlines around the world, each of which has their own particular requirements. We are currently trying to move all our customers from the current front-ended mainframe system to a ‘modern’ internet based system.

We’re using java and Oracle db on the new system, which appears to cope well with this type of bottleneck. Most of the transactions are in real time - not sure of the fine detail (I’m not a java programmer), but I know the system is coping very well with continuous complex requests, with an average of 800 users on line at any one time, 24/7.

Not that I’m suggesting you should move to Java, it’s obviously too late for that. I’m simply empathizing with your dilemma. Airline requirements can be a pain, whatever area you’re working in.

Does Vali’s last post mean he has found a solution?

cpradio · January 31, 2013, 2:58pm

No, I think he was just letting me know why they choose json responses after considering some of the recommendations I made. json isn’t as heavy as XML but it is still bulkier than plain text layouts, but I understand that finding that line between maintainability and usability and speed is difficult. I think we’ve given several ideas here, the biggest is profiling the process, truly discover where the issue lies and if it could be the result of other items (for my tests, generating the data was far slower than running json_encode). However, reading that json was definitely a spot for bottleneck too.

The question becomes, which is taking longer on his system, is there ways to speed up the generation of the json object (cut 2-3 seconds there and you are making good progress already!), tackling the reading will be far more difficult as there isn’t much you can do to speed up the json reading (unless using the optional parameter of true/false makes any significant difference – I’m not sure it will).

Vali · January 31, 2013, 7:07pm

Hey guys,
Thanks for all the suggestions.

No, I have not yet found a solution for this, but I have about 20 ideas that came from various replies to try out.
(that should take some time, but I will post what I find here, just to satisfy some curiosity)

For json_encode/json_decode vs serialise/unserialize,
json_encode is faster than serialise, and json_decode is slower than unserialize
But, with my average data set, json encode+decode is faster than serialise+unserialize

cpradio:

My point was if there was anything in the returned json data for each reply that you do not use, you are loading it into memory during json_decode and then ditching it, but maybe that isn’t what you are doing here, maybe you are filtering out whole json results from some of the 900 replies. Can you elaborate on that process?

I am loading all the data for each worker in memory, and then ditching what I don’t need, but since that part works sequentially on the master/parent/controller(whatever you want to call it) script, it only loads one response in memory at a time (not all 900), that is why I’m not hitting such a big memory wall.

As for the bulk inserts into a database/cache/whatever, that would mean I need a central system where I send the data, and read it back from, and either way, this data needs to be serialised/unserialised somehow (and when I had it like that, it maxed up the network).

As for hitting the cache more, the problem is that it requires more servers than I have.
There are two aproaches here: fill up the cache (lets say nightly) with all the possible searches (resulting in faster first search, and more cache space needed) VS fill up the cache with the searches as they happen (resulting in slower first search, throw out old searches as I need more cache space).
When you calculate how many possible ways to get from point A to point B you have, you can’t really cache everything (last time I did that I had a number with 36 digits for just the outbound possibilities, when using VIA points/different connection times)

I will get back to this thread with whatever solutions I end up testing and the results I get.

KyleWolfe · January 31, 2013, 7:33pm

Is it not possible to pull all flights(30-50k a day by my calc) and calculate flight paths on your end?

EDIT: Never mind, that would be for US flights only.

KyleWolfe · January 31, 2013, 8:11pm

More thoughts: What version of PHP are you running. Do you have garbage collection enabled? When you are manipulating the json / arrays, are you duplicating the arrays out that might cause some extra memory usage?


$bigArray = array(); //your return result from remote server

$object2 = $bigArray()

Obviously your not using arrays or just going to straight up copy an object like that, but you may be altering it in some way and moving that altered object to a new variable, if that makes sense.

cpradio · January 31, 2013, 8:21pm

K_Wolfe:

More thoughts: What version of PHP are you running. Do you have garbage collection enabled? When you are manipulating the json / arrays, are you duplicating the arrays out that might cause some extra memory usage?
$bigArray = array(); //your return result from remote server

$object2 = $bigArray()
Obviously your not using arrays or just going to straight up copy an object like that, but you may be altering it in some way and moving that altered object to a new variable, if that makes sense.

That’s actually a really good question, at first I thought maybe arrays would assume by reference in assignment (they don’t – unlike other languages).

$smallArray = array(1, 2, 3, 4, 5, 6, 7);
$newArray = $smallArray;
array_pop($smallArray);

var_dump($smallArray, $newArray);

Output

array(6) {
  [0]=>
  int(1)
  [1]=>
  int(2)
  [2]=>
  int(3)
  [3]=>
  int(4)
  [4]=>
  int(5)
  [5]=>
  int(6)
}
array(7) {
  [0]=>
  int(1)
  [1]=>
  int(2)
  [2]=>
  int(3)
  [3]=>
  int(4)
  [4]=>
  int(5)
  [5]=>
  int(6)
  [6]=>
  int(7)
}

However, forcing by reference

$smallArray = array(1, 2, 3, 4, 5, 6, 7);
$newArray = &$smallArray;
array_pop($smallArray);

var_dump($smallArray, $newArray);

Output

array(6) {
  [0]=>
  int(1)
  [1]=>
  int(2)
  [2]=>
  int(3)
  [3]=>
  int(4)
  [4]=>
  int(5)
  [5]=>
  int(6)
}
array(6) {
  [0]=>
  int(1)
  [1]=>
  int(2)
  [2]=>
  int(3)
  [3]=>
  int(4)
  [4]=>
  int(5)
  [5]=>
  int(6)
}

Michael_Morris1 · January 31, 2013, 8:26pm

@Vali Two thoughts occur to me. First K. Wolfe might be onto something here - significant large object manipulation efficiency has been realized in the PHP engine in the last several major upgrade blocks. Second, if you aren’t adverse to giving other languages a shot at this problem, the worker section of it might be more suited to nodejs than PHP, especially since you are building and exporting javascript objects, and node is server side javascript built on the Chrome js engine.

Vali · January 31, 2013, 8:39pm

I’m using PHP 5.4.8, and I know about their performance updates.

Michael Morris I was actually looking into node.js, but I was not sure how to link all the business rules to it… without remaking them in js that is…

joebert · February 2, 2013, 7:56pm

There has to be some sort of behavioral / popularity / business patterns you can find in the history of things people are looking for that can be used to dramatically reduce that 36 digit number of pre-cache possibilities. Maybe there are certain airlines or combinations that are more profitable than others you can give priority too, maybe there are certain combinations that aren’t profitable enough to return a result for, then you could just return some alternative service to those visitors.

Vali · February 5, 2013, 4:25pm

Just a quick update for those interested.
I’m still looking for a better way to do this, but I found something:

igbinary_serialize / igbinary_unserialize

It can be used exactly like serialize/unserialize, and using it instead of json_encode/json_decode.
It creates much smaller strings (faster network transfer), it’s using about half the ram, and it’s twice as fast.

It’s a long way from the final solution, but for now it does the job.

cpradio · February 5, 2013, 4:29pm

That’s fantastic news! Dropping half ram is a big feat. As you find additional ways (if that is still on the roadmap) please do keep us informed. I’m quite interested in what tactics may help resolve your bottlenecks (anymore, it seems to be a focus I’m finding myself in time and time again; how to reduce bottlenecks)