SitePoint Sponsor

User Tag List

Page 1 of 2 12 LastLast
Results 1 to 25 of 27
  1. #1
    SitePoint Guru
    Join Date
    Jun 2006
    Posts
    638
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Large PHP Arrays: Serialise/store Judy

    Hello everone,

    Has anyone used php-Judy?

    If so, how can serialise that data, so I can use it later?

    I have a few arrays that I have to work with, and pass to other scripts that take 500MB in PHP, 400MB in SplFixedArray, 10MB in php-Judy and a few MB serialized.
    But having problems getting the data set in the Judy array back to serialize it (without manually looping it and making normal objects)...

    Anyone worked with very large arrays in php and found a better solution? or knows how to deal with Judy?

  2. #2
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    65 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by Vali View Post
    Hello everone,

    Has anyone used php-Judy?

    If so, how can serialise that data, so I can use it later?

    I have a few arrays that I have to work with, and pass to other scripts that take 500MB in PHP, 400MB in SplFixedArray, 10MB in php-Judy and a few MB serialized.
    But having problems getting the data set in the Judy array back to serialize it (without manually looping it and making normal objects)...

    Anyone worked with very large arrays in php and found a better solution? or knows how to deal with Judy?
    I run several gigs worth of arrays in my scripts with no problem (22k rows a second with no overhead attached), why is it you feel you need this extension and need to serialize? Your going to take a performance hit.

  3. #3
    PHP Guru lampcms.com's Avatar
    Join Date
    Jan 2009
    Posts
    921
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Never heard of it but if what you saying is true - 10MB vs 500MB in normal php array, it's definitely worth a look.
    I'm skeptical because of such a huge difference but if it's really true then I'll probably be using it.

    Thanks for pointing this out.
    My project: Open source Q&A
    (similar to StackOverflow)
    powered by php+MongoDB
    Source on github, collaborators welcome!

  4. #4
    SitePoint Guru
    Join Date
    Jun 2006
    Posts
    638
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by K. Wolfe View Post
    I run several gigs worth of arrays in my scripts with no problem (22k rows a second with no overhead attached), why is it you feel you need this extension and need to serialize? Your going to take a performance hit.
    I have a few k rows also (10k+), but each is a multidimensional array with string keys (objects), that get passed from one script to another (assume serialize/unserialise), but when they get passed over, they hit the PHP memory limit and bomb out (ok, I can increase that, but each request takes ~30sec, so can't really give each one 2GB ram...)

    And the main issue is this: How-big-are-PHP-arrays-really-Hint-BIG (Arrays are way bigger than the data they have to store...)

    What I actually need, is a way to group data (ex: classes/structure/objects), without the overhead of creating the class (that gets way to slow)

    Any suggestions?

  5. #5
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    5,136
    Mentioned
    152 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Vali View Post
    Hello everone,

    Has anyone used php-Judy?

    If so, how can serialise that data, so I can use it later?

    I have a few arrays that I have to work with, and pass to other scripts that take 500MB in PHP, 400MB in SplFixedArray, 10MB in php-Judy and a few MB serialized.
    But having problems getting the data set in the Judy array back to serialize it (without manually looping it and making normal objects)...

    Anyone worked with very large arrays in php and found a better solution? or knows how to deal with Judy?
    Found this mail archive: http://www.mail-archive.com/judy-dev.../msg00147.html

  6. #6
    SitePoint Guru
    Join Date
    Jun 2006
    Posts
    638
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Sadly I found the same thing...
    So my current solution is to keep the objects serialized in binary format, and unserialize them on use...
    Less CPU but a ton less RAM needed.

    I'll keep an eye on this thread in case someone has any better ideas.

    K. Wolfe I would be interested in your input.

  7. #7
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    65 Post(s)
    Tagged
    2 Thread(s)
    Whats the nature of these scripts? Why do you need to pass the entire array? Can you localize the scripts into one?

  8. #8
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    5,136
    Mentioned
    152 Post(s)
    Tagged
    0 Thread(s)
    To prevent a lot of overlap, I assumed this is still in relation to http://www.sitepoint.com/forums/show...etween-scripts

    Hopefully, that assumption was correct. @K. Wolfe ; in other words, it is the whole multiple servers processing 900+ responses each containing several a few megs of data, that get combined together as they move up the stack.

  9. #9
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    65 Post(s)
    Tagged
    2 Thread(s)
    Ohhh.. Same guy eh? I'm giong to fall back to my original thought, from the older thread, atleast I thought I brought it up..

    You need to set up a central "application server" and then a data warehouse of sorts.

    We know that your going to need some multi threading in order to get those curl / soap calls completed timely (or in this case child processes since php doesnt have multi-threading ) We want to 1) ultimately increase all performance which will mean the rest: 2) lower network traffic 3) lower processing being done on each array.

    I'm about to mention something that's frowned upon by some people here, but sometimes some jobs call for something such as this, I'm currently setting up a datawarehouse using it due to the sheer volume of data I'm dealing with:

    I recommend setting up a central data server with MongoDB. http://www.mongodb.org/
    You can store multidimensional data (arrays) in this database. There is no strict table structure, each row can have its own structure, which means you need to program those restrictions in the the code, if they need to exist.

    Say the return you get from those curl calls is xml, you'd only need the following:

    Code PHP:
    //connect to mongo    
    $m = new Mongo();
    $coll = $m->dbName->xml_responses;
    $coll->insert(json_decode(json_encode(simplexml_load_string($curlResponse)), 1));

    This will have loaded your array into mongodb. While yes, we are running a json_encode and decode on this, we can drop a lot of network traffic. When all of your 900 requests have been fulfilled, you can actually QUERY your return results, rather than manipulate it in memory.

    MongoDB is a whole new beast that isn't covered often here at the forums, before running away from the idea, have a look at it. I've grown to love it (in the right situations) and one the places it excels is those randomly structured XML / json structures. You can even run in some caching based off these results. You can index anything with the subdocument as if it were a normal column (an array element is 5 levels deep? you can index it and query for it)

  10. #10
    SitePoint Guru
    Join Date
    Jun 2006
    Posts
    638
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Did not know mongodb can add indexes on multidimensional arrays, I'll check it out.

    I will have to make some tests, since while I only insert a few thousand records (50-100k), I need to update a few fields in all of them, so not sure how fast that gets done in mangodb.

    Currently, I store each object twice:
    One in binary format (so I can pass it between scripts), and once in Judy format so I can work with it.
    In my app, 11k records take 0.3sec to process (updates and so on), but 1.2sec and 30MB (instead of 185MB normal arrays) to igbinary_unserialize and turn the data to a Judy object.
    BUT, at this point I don't need all the data, only part of it, so since I can pick what to show, the 1.2 sec turns to 0.1 and 30MB to 300KB.

  11. #11
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    65 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by Vali View Post
    Did not know mongodb can add indexes on multidimensional arrays, I'll check it out.

    I will have to make some tests, since while I only insert a few thousand records (50-100k), I need to update a few fields in all of them, so not sure how fast that gets done in mangodb.

    Currently, I store each object twice:
    One in binary format (so I can pass it between scripts), and once in Judy format so I can work with it.
    In my app, 11k records take 0.3sec to process (updates and so on), but 1.2sec and 30MB (instead of 185MB normal arrays) to igbinary_unserialize and turn the data to a Judy object.
    BUT, at this point I don't need all the data, only part of it, so since I can pick what to show, the 1.2 sec turns to 0.1 and 30MB to 300KB.
    Keep this in mind though:

    The maximum BSON document size is 16 megabytes.

    • The maximum document size helps ensure that a single document cannot use excessive amount of RAM or, during transmission, excessive amount of bandwidth. To store documents larger than the maximum size, MongoDB provides the GridFS API. See mongofiles and the documentation for your driver for more information about GridFS.

    This is not a huge deal, just don't go crazy when trying to aggregate your data

  12. #12
    SitePoint Guru
    Join Date
    Jun 2006
    Posts
    638
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ya, thought I had it, but Judy does not work on Debian GNU (live servers) but works on Ubuntu (dev server)...
    When I iterate the keys I get junk in there for Debian, so seams like a memory leak or something.

  13. #13
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    65 Post(s)
    Tagged
    2 Thread(s)
    Can you supply a sample reply from your soap / curl call for me? I'm curious to see the format of these, this project still intrigues me

  14. #14
    SitePoint Guru
    Join Date
    Jun 2006
    Posts
    638
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by K. Wolfe View Post
    Can you supply a sample reply from your soap / curl call for me? I'm curious to see the format of these, this project still intrigues me
    PMed you a sample

  15. #15
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    65 Post(s)
    Tagged
    2 Thread(s)
    Your PM disappeared? lol Can you try another source, that one is timing out on me

  16. #16
    SitePoint Guru
    Join Date
    Jun 2006
    Posts
    638
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    This one should work then:
    http://nopaste.dk/p21345

  17. #17
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    65 Post(s)
    Tagged
    2 Thread(s)
    Sorry, I'm starting to have a small look at this today. So this is one of your many requests your children will send out then?

    Whats your environment look like again? how many nodes and their specs?

  18. #18
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    65 Post(s)
    Tagged
    2 Thread(s)
    Can you throw me a few more types of requests (dont need the response) and give me a quick rundown of what each fields change does for you.

  19. #19
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    65 Post(s)
    Tagged
    2 Thread(s)
    After some offline discussions, My top level thoughts:

    I would recommend loading some sample xml responses into MongoDB, and then try pulling them out, see what your memory looks like through there. I really feel that a json based db is your answer here for caching and transfer.

    On top of this, possibly Java / Python to handle your soap calls as this has true multi threading capability (im referring to a single process running as a service to handle all your requests to the external source). Threads (children processes) would be dynamically created and destroyed based on requests given to it and then feed the response back to mongo.

    The front end would wait for a completion flag within mongo for that session id before retrieving the data.

  20. #20
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    5,136
    Mentioned
    152 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by K. Wolfe View Post
    After some offline discussions, My top level thoughts:

    I would recommend loading some sample xml responses into MongoDB, and then try pulling them out, see what your memory looks like through there. I really feel that a json based db is your answer here for caching and transfer.

    On top of this, possibly Java / Python to handle your soap calls as this has true multi threading capability (im referring to a single process running as a service to handle all your requests to the external source). Threads (children processes) would be dynamically created and destroyed based on requests given to it and then feed the response back to mongo.

    The front end would wait for a completion flag within mongo for that session id before retrieving the data.
    And if you are required to use PHP, you can create a restful HTTP service in PHP so you can have multi-threading (so long as you execute multiple simultaneous calls to these services). Granted, if I were writing it, the top controller would be written in a different language (C#, Java, Python, anything that has a strong multi-threading framework) so you can send out requests and manage them in a thread queue (and tell it to wait till all return before doing further processing).

    A few tutorials:
    http://phpmaster.com/writing-a-restf...ice-with-slim/
    http://www.9lessons.info/2012/05/cre...pi-in-php.html

  21. #21
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    65 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by cpradio View Post
    And if you are required to use PHP, you can create a restful HTTP service in PHP so you can have multi-threading (so long as you execute multiple simultaneous calls to these services). Granted, if I were writing it, the top controller would be written in a different language (C#, Java, Python, anything that has a strong multi-threading framework) so you can send out requests and manage them in a thread queue (and tell it to wait till all return before doing further processing).

    A few tutorials:
    http://phpmaster.com/writing-a-restf...ice-with-slim/
    http://www.9lessons.info/2012/05/cre...pi-in-php.html
    I actually had some in depth conversations today at work about emulating multi threading (task forking) within php. It was pretty interesting to say the least.

  22. #22
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    5,136
    Mentioned
    152 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by K. Wolfe View Post
    I actually had some in depth conversations today at work about emulating multi threading (task forking) within php. It was pretty interesting to say the least.
    There are ways of doing it, I just haven't found any that are as clean to implement as .NET (the Parallel framework is amazing). I'm not familiar enough with Java and Python to denote if they have anything similar or as easy.

  23. #23
    SitePoint Guru
    Join Date
    Jun 2006
    Posts
    638
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    My issue is not starting up the threads, I can handle about half a million of them per minute with 3 really old servers (each thread takes ~30 sec to finish).

    My problem is being able to hold the data returned by them in memory, and with normal PHP arrays I need 18 times the memory I should need...
    So looking for a solution that I can use just like PHP arrays (I need to loop those records a couple of times, update some of their fields, group them up and show them to the user)

    I was reading on MangoDB, but I don't have a good plan for it yet... since I multiply the scraped results with whatever the client rules are, ending up with multiple results (1*N), so with MangoDB I would have to make a ton of inserts/gets from the same thread instance, so a ton of json_encode/decode and so on for nothing...

  24. #24
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    65 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by Vali View Post
    My problem is being able to hold the data returned by them in memory, and with normal PHP arrays I need 18 times the memory I should need...
    Well you've alrady heard this from me and one other, somethings going wrong, this isn't correct.

    But as I said before, a JSON based db for staging results would allow you to get that data out of memory (at least out of your application side).

    Quote Originally Posted by Vali View Post
    so a ton of json_encode/decode and so on for nothing...
    Who said you have to use json_encode / decode? This is part of why your app is breaking down I think, you want to avoid this. The result is returned as an array, except for a few MognoObjects inside (dates, ids), you CAN use json_encode and decode together to convert those from objects to arrays, but its not necessary.

    I'm a little disappointed that throughout these two threads we've been discussing this, we haven't been able to see ANY of your php code to see if your duplicating efforts somewhere to create a memory problem.

  25. #25
    SitePoint Guru
    Join Date
    Jun 2006
    Posts
    638
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by K. Wolfe View Post
    I'm a little disappointed that throughout these two threads we've been discussing this, we haven't been able to see ANY of your php code to see if your duplicating efforts somewhere to create a memory problem.
    The PHP side of the code is 192MB, it's not a few lines of code, so a bit to complex to show here... that's why I just described the overall picture.

    And when you add a multidimensional array to MangoDB, doesn't it store it in binary json format? So it would have to serialise it?


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •