Large PHP Arrays: Serialise/store Judy

Hello everone,

Has anyone used php-Judy?

If so, how can serialise that data, so I can use it later?

I have a few arrays that I have to work with, and pass to other scripts that take 500MB in PHP, 400MB in SplFixedArray, 10MB in php-Judy and a few MB serialized.
But having problems getting the data set in the Judy array back to serialize it (without manually looping it and making normal objects)…

Anyone worked with very large arrays in php and found a better solution? or knows how to deal with Judy?

I run several gigs worth of arrays in my scripts with no problem (22k rows a second with no overhead attached), why is it you feel you need this extension and need to serialize? Your going to take a performance hit.

Never heard of it but if what you saying is true - 10MB vs 500MB in normal php array, it’s definitely worth a look.
I’m skeptical because of such a huge difference but if it’s really true then I’ll probably be using it.

Thanks for pointing this out.

I have a few k rows also (10k+), but each is a multidimensional array with string keys (objects), that get passed from one script to another (assume serialize/unserialise), but when they get passed over, they hit the PHP memory limit and bomb out (ok, I can increase that, but each request takes ~30sec, so can’t really give each one 2GB ram…)

And the main issue is this: How-big-are-PHP-arrays-really-Hint-BIG (Arrays are way bigger than the data they have to store…)

What I actually need, is a way to group data (ex: classes/structure/objects), without the overhead of creating the class (that gets way to slow)

Any suggestions?

Found this mail archive: http://www.mail-archive.com/judy-devel@lists.sourceforge.net/msg00147.html

Sadly I found the same thing…
So my current solution is to keep the objects serialized in binary format, and unserialize them on use…
Less CPU but a ton less RAM needed.

I’ll keep an eye on this thread in case someone has any better ideas.

K. Wolfe I would be interested in your input.

Whats the nature of these scripts? Why do you need to pass the entire array? Can you localize the scripts into one?

To prevent a lot of overlap, I assumed this is still in relation to http://www.sitepoint.com/forums/showthread.php?966015-Efficient-way-to-passing-large-objects-between-scripts

Hopefully, that assumption was correct. @K_Wolfe ; in other words, it is the whole multiple servers processing 900+ responses each containing several a few megs of data, that get combined together as they move up the stack.

Ohhh… Same guy eh? I’m giong to fall back to my original thought, from the older thread, atleast I thought I brought it up…

You need to set up a central “application server” and then a data warehouse of sorts.

We know that your going to need some multi threading in order to get those curl / soap calls completed timely (or in this case child processes since php doesnt have multi-threading :frowning: ) We want to 1) ultimately increase all performance which will mean the rest: 2) lower network traffic 3) lower processing being done on each array.

I’m about to mention something that’s frowned upon by some people here, but sometimes some jobs call for something such as this, I’m currently setting up a datawarehouse using it due to the sheer volume of data I’m dealing with:

I recommend setting up a central data server with MongoDB. http://www.mongodb.org/
You can store multidimensional data (arrays) in this database. There is no strict table structure, each row can have its own structure, which means you need to program those restrictions in the the code, if they need to exist.

Say the return you get from those curl calls is xml, you’d only need the following:


//connect to mongo    
$m = new Mongo();
$coll = $m->dbName->xml_responses;
$coll->insert(json_decode(json_encode(simplexml_load_string($curlResponse)), 1));

This will have loaded your array into mongodb. While yes, we are running a json_encode and decode on this, we can drop a lot of network traffic. When all of your 900 requests have been fulfilled, you can actually QUERY your return results, rather than manipulate it in memory.

MongoDB is a whole new beast that isn’t covered often here at the forums, before running away from the idea, have a look at it. I’ve grown to love it (in the right situations) and one the places it excels is those randomly structured XML / json structures. You can even run in some caching based off these results. You can index anything with the subdocument as if it were a normal column (an array element is 5 levels deep? you can index it and query for it)

Did not know mongodb can add indexes on multidimensional arrays, I’ll check it out.

I will have to make some tests, since while I only insert a few thousand records (50-100k), I need to update a few fields in all of them, so not sure how fast that gets done in mangodb.

Currently, I store each object twice:
One in binary format (so I can pass it between scripts), and once in Judy format so I can work with it.
In my app, 11k records take 0.3sec to process (updates and so on), but 1.2sec and 30MB (instead of 185MB normal arrays) to igbinary_unserialize and turn the data to a Judy object.
BUT, at this point I don’t need all the data, only part of it, so since I can pick what to show, the 1.2 sec turns to 0.1 and 30MB to 300KB. :slight_smile:

Keep this in mind though:

The maximum BSON document size is 16 megabytes.

This is not a huge deal, just don’t go crazy when trying to aggregate your data

Ya, thought I had it, but Judy does not work on Debian GNU (live servers) but works on Ubuntu (dev server)…
When I iterate the keys I get junk in there for Debian, so seams like a memory leak or something.

Can you supply a sample reply from your soap / curl call for me? I’m curious to see the format of these, this project still intrigues me

PMed you a sample

Your PM disappeared? lol Can you try another source, that one is timing out on me

This one should work then:
http://nopaste.dk/p21345

Sorry, I’m starting to have a small look at this today. So this is one of your many requests your children will send out then?

Whats your environment look like again? how many nodes and their specs?

Can you throw me a few more types of requests (dont need the response) and give me a quick rundown of what each fields change does for you.

After some offline discussions, My top level thoughts:

I would recommend loading some sample xml responses into MongoDB, and then try pulling them out, see what your memory looks like through there. I really feel that a json based db is your answer here for caching and transfer.

On top of this, possibly Java / Python to handle your soap calls as this has true multi threading capability (im referring to a single process running as a service to handle all your requests to the external source). Threads (children processes) would be dynamically created and destroyed based on requests given to it and then feed the response back to mongo.

The front end would wait for a completion flag within mongo for that session id before retrieving the data.

And if you are required to use PHP, you can create a restful HTTP service in PHP so you can have multi-threading (so long as you execute multiple simultaneous calls to these services). Granted, if I were writing it, the top controller would be written in a different language (C#, Java, Python, anything that has a strong multi-threading framework) so you can send out requests and manage them in a thread queue (and tell it to wait till all return before doing further processing).

A few tutorials: