SitePoint Sponsor

User Tag List

Page 2 of 2 FirstFirst 12
Results 26 to 44 of 44
  1. #26
    SitePoint Guru bronze trophy
    Join Date
    Dec 2003
    Location
    Poland
    Posts
    930
    Mentioned
    7 Post(s)
    Tagged
    0 Thread(s)
    This has been an amazing read, I'm impressed by the magnitude of processing in this system, it makes my head spin when I try to imagine that

    From my short test I found out that indeed json_encode() is faster than serialize() but also that when the size of the array doubles the time spent on json_encode() increases more than double, which suggests that the larger the input the worse the performance becomes.

    I don't see how processing such large amounts of data can be done efficiently, the json parser has to do its work and there are no shortcuts. There are two solutions that come to my mind:

    1) Write a php extension in C that will do the serialization and unserialization. Your extension has the potential of being faster than the built-in functions because you most probably don't need support for all datatypes and such so the code could be simplified as much as possible to suit your needs.

    2) This is just speculation because I don't have enough knowledge in PHP internals - but I think the best solution would be to get rid of serializing/endoing/decoding steps altogether. If all the servers exchanging data use php scripts then I suppose php must be storing those arrays in some internal (binary?) format in memory while the script it running so if you were somehow able to grab the internal representation of the array and pass it directly to another server which would be able to inject it into its own running script then you would save a lot of unnecessary processing of the data back and forth. This would at least require a php extension or even some digging and patching of the php code and making your custom php release - but if the project is of such a large scope then maybe it would be worth investigating such options.

  2. #27
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    5,240
    Mentioned
    155 Post(s)
    Tagged
    0 Thread(s)
    Another aspect is if you are processing all 900 responses, you may be holding on to memory allocation that is no longer necessary and could be cleared up (either through garbage collecting, or disposing of it within your application -- set them to null or use unset).

    Another test that would be interesting is testing json_decode($data, true) versus json_decode($data, false). The latter returns an object the former an associative array.

    Then along the lines of what Lemon Juice was suggesting you may want to consider HipHop for PHP which takes your PHP code and transforms it to C++ (there is also HPHPi which does all this at run-time too, but that to my knowledge hasn't moved into production ready yet)
    http://en.wikipedia.org/wiki/HipHop_for_PHP

    Lastly, is all of the processing happening through web protocols using php through apache or php cli? You may get some benefit by going to php cli, but I'm not certain on that.

  3. #28
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    67 Post(s)
    Tagged
    2 Thread(s)
    Crap. I had some stuff come up. But what I was going to get to was: Are any of the returns from the remote sources reusable?

  4. #29
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    67 Post(s)
    Tagged
    2 Thread(s)
    Is that format the only possible call you can make? From the speed your getting on the calls it might be possible to start building up a better cache of data on a nightly basis. I'm working on a similar project using MongoDB, which allows the storing of multidimensional data (json) quickly and painlessly. It has much better select performance than relational dbs (mysql, MSSQL).

  5. #30
    SitePoint Zealot 2ndmouse's Avatar
    Join Date
    Jan 2007
    Location
    West London
    Posts
    196
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Very interesting indeed. I happen to work for a company who provides logistics software for about 40 airlines around the world, each of which has their own particular requirements. We are currently trying to move all our customers from the current front-ended mainframe system to a 'modern' internet based system.

    We're using java and Oracle db on the new system, which appears to cope well with this type of bottleneck. Most of the transactions are in real time - not sure of the fine detail (I'm not a java programmer), but I know the system is coping very well with continuous complex requests, with an average of 800 users on line at any one time, 24/7.

    Not that I'm suggesting you should move to Java, it's obviously too late for that. I'm simply empathizing with your dilemma. Airline requirements can be a pain, whatever area you're working in.

    Does Vali's last post mean he has found a solution?
    Detect file changes remotely. SimpleSiteAudit is an early
    warning anti-hacker system which sends an alert on detection.

    PHP Find Orphan Files - Finds all the unreferenced files on your site.

  6. #31
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    5,240
    Mentioned
    155 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by 2ndmouse View Post
    Does Vali's last post mean he has found a solution?
    No, I think he was just letting me know why they choose json responses after considering some of the recommendations I made. json isn't as heavy as XML but it is still bulkier than plain text layouts, but I understand that finding that line between maintainability and usability and speed is difficult. I think we've given several ideas here, the biggest is profiling the process, truly discover where the issue lies and if it could be the result of other items (for my tests, generating the data was far slower than running json_encode). However, reading that json was definitely a spot for bottleneck too.

    The question becomes, which is taking longer on his system, is there ways to speed up the generation of the json object (cut 2-3 seconds there and you are making good progress already!), tackling the reading will be far more difficult as there isn't much you can do to speed up the json reading (unless using the optional parameter of true/false makes any significant difference -- I'm not sure it will).

  7. #32
    SitePoint Guru
    Join Date
    Jun 2006
    Posts
    638
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hey guys,
    Thanks for all the suggestions.

    No, I have not yet found a solution for this, but I have about 20 ideas that came from various replies to try out.
    (that should take some time, but I will post what I find here, just to satisfy some curiosity)

    For json_encode/json_decode vs serialise/unserialize,
    json_encode is faster than serialise, and json_decode is slower than unserialize
    But, with my average data set, json encode+decode is faster than serialise+unserialize

    cpradio:
    My point was if there was anything in the returned json data for each reply that you do not use, you are loading it into memory during json_decode and then ditching it, but maybe that isn't what you are doing here, maybe you are filtering out whole json results from some of the 900 replies. Can you elaborate on that process?
    I am loading all the data for each worker in memory, and then ditching what I don't need, but since that part works sequentially on the master/parent/controller(whatever you want to call it) script, it only loads one response in memory at a time (not all 900), that is why I'm not hitting such a big memory wall.

    As for the bulk inserts into a database/cache/whatever, that would mean I need a central system where I send the data, and read it back from, and either way, this data needs to be serialised/unserialised somehow (and when I had it like that, it maxed up the network).

    As for hitting the cache more, the problem is that it requires more servers than I have.
    There are two aproaches here: fill up the cache (lets say nightly) with all the possible searches (resulting in faster first search, and more cache space needed) VS fill up the cache with the searches as they happen (resulting in slower first search, throw out old searches as I need more cache space).
    When you calculate how many possible ways to get from point A to point B you have, you can't really cache everything (last time I did that I had a number with 36 digits for just the outbound possibilities, when using VIA points/different connection times)

    I will get back to this thread with whatever solutions I end up testing and the results I get.

  8. #33
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    67 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by Vali View Post
    VS fill up the cache with the searches as they happen (resulting in slower first search, throw out old searches as I need more cache space).
    When you calculate how many possible ways to get from point A to point B you have, you can't really cache everything (last time I did that I had a number with 36 digits for just the outbound possibilities, when using VIA points/different connection times)
    Is it not possible to pull all flights(30-50k a day by my calc) and calculate flight paths on your end?

    EDIT: Never mind, that would be for US flights only.

  9. #34
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    67 Post(s)
    Tagged
    2 Thread(s)
    More thoughts: What version of PHP are you running. Do you have garbage collection enabled? When you are manipulating the json / arrays, are you duplicating the arrays out that might cause some extra memory usage?

    Code PHP:
    $bigArray = array(); //your return result from remote server
     
    $object2 = $bigArray()

    Obviously your not using arrays or just going to straight up copy an object like that, but you may be altering it in some way and moving that altered object to a new variable, if that makes sense.

  10. #35
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    5,240
    Mentioned
    155 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by K. Wolfe View Post
    More thoughts: What version of PHP are you running. Do you have garbage collection enabled? When you are manipulating the json / arrays, are you duplicating the arrays out that might cause some extra memory usage?

    Code PHP:
    $bigArray = array(); //your return result from remote server
     
    $object2 = $bigArray()

    Obviously your not using arrays or just going to straight up copy an object like that, but you may be altering it in some way and moving that altered object to a new variable, if that makes sense.
    That's actually a really good question, at first I thought maybe arrays would assume by reference in assignment (they don't -- unlike other languages).
    PHP Code:
    $smallArray = array(1234567);
    $newArray $smallArray;
    array_pop($smallArray);

    var_dump($smallArray$newArray); 
    Output
    Code:
    array(6) {
      [0]=>
      int(1)
      [1]=>
      int(2)
      [2]=>
      int(3)
      [3]=>
      int(4)
      [4]=>
      int(5)
      [5]=>
      int(6)
    }
    array(7) {
      [0]=>
      int(1)
      [1]=>
      int(2)
      [2]=>
      int(3)
      [3]=>
      int(4)
      [4]=>
      int(5)
      [5]=>
      int(6)
      [6]=>
      int(7)
    }
    However, forcing by reference
    PHP Code:
    $smallArray = array(1234567);
    $newArray = &$smallArray;
    array_pop($smallArray);

    var_dump($smallArray$newArray); 
    Output
    Code:
    array(6) {
      [0]=>
      int(1)
      [1]=>
      int(2)
      [2]=>
      int(3)
      [3]=>
      int(4)
      [4]=>
      int(5)
      [5]=>
      int(6)
    }
    array(6) {
      [0]=>
      int(1)
      [1]=>
      int(2)
      [2]=>
      int(3)
      [3]=>
      int(4)
      [4]=>
      int(5)
      [5]=>
      int(6)
    }

  11. #36
    I solve practical problems. bronze trophy
    Michael Morris's Avatar
    Join Date
    Jan 2008
    Location
    Knoxville TN
    Posts
    2,053
    Mentioned
    66 Post(s)
    Tagged
    0 Thread(s)
    @Vali Two thoughts occur to me. First K. Wolfe might be onto something here - significant large object manipulation efficiency has been realized in the PHP engine in the last several major upgrade blocks. Second, if you aren't adverse to giving other languages a shot at this problem, the worker section of it might be more suited to nodejs than PHP, especially since you are building and exporting javascript objects, and node is server side javascript built on the Chrome js engine.

  12. #37
    SitePoint Guru
    Join Date
    Jun 2006
    Posts
    638
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I'm using PHP 5.4.8, and I know about their performance updates.

    Michael Morris I was actually looking into node.js, but I was not sure how to link all the business rules to it... without remaking them in js that is...

  13. #38
    Floridiot joebert's Avatar
    Join Date
    Mar 2004
    Location
    Kenneth City, FL
    Posts
    823
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    There has to be some sort of behavioral / popularity / business patterns you can find in the history of things people are looking for that can be used to dramatically reduce that 36 digit number of pre-cache possibilities. Maybe there are certain airlines or combinations that are more profitable than others you can give priority too, maybe there are certain combinations that aren't profitable enough to return a result for, then you could just return some alternative service to those visitors.

  14. #39
    SitePoint Guru
    Join Date
    Jun 2006
    Posts
    638
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Just a quick update for those interested.
    I'm still looking for a better way to do this, but I found something:

    igbinary_serialize / igbinary_unserialize

    It can be used exactly like serialize/unserialize, and using it instead of json_encode/json_decode.
    It creates much smaller strings (faster network transfer), it's using about half the ram, and it's twice as fast.

    It's a long way from the final solution, but for now it does the job.

  15. #40
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    5,240
    Mentioned
    155 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Vali View Post
    Just a quick update for those interested.
    I'm still looking for a better way to do this, but I found something:

    igbinary_serialize / igbinary_unserialize

    It can be used exactly like serialize/unserialize, and using it instead of json_encode/json_decode.
    It creates much smaller strings (faster network transfer), it's using about half the ram, and it's twice as fast.

    It's a long way from the final solution, but for now it does the job.
    That's fantastic news! Dropping half ram is a big feat. As you find additional ways (if that is still on the roadmap) please do keep us informed. I'm quite interested in what tactics may help resolve your bottlenecks (anymore, it seems to be a focus I'm finding myself in time and time again; how to reduce bottlenecks)

  16. #41
    SitePoint Guru
    Join Date
    Jun 2006
    Posts
    638
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I found that most the time it's the overall system logic of the system that gives the biggest improvements.
    Even if this system is pretty huge, I had to change literally those two functions and gained about 3x speed improvement, but in reality this runs on N servers and it's in N levels (api calls api that calls api, etc...)

    If I find something better, I'll post here.

    But for now, the next bottleneck is XML parsing, it can now take up to 30% of the page run time (and I get that from another system, so can't change it).
    Using simplexml_load_string / xpath to load only the data I need (xpath), but it's still slow and the code is messy...

  17. #42
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    5,240
    Mentioned
    155 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Vali View Post
    I found that most the time it's the overall system logic of the system that gives the biggest improvements.
    Even if this system is pretty huge, I had to change literally those two functions and gained about 3x speed improvement, but in reality this runs on N servers and it's in N levels (api calls api that calls api, etc...)

    If I find something better, I'll post here.

    But for now, the next bottleneck is XML parsing, it can now take up to 30% of the page run time (and I get that from another system, so can't change it).
    Using simplexml_load_string / xpath to load only the data I need (xpath), but it's still slow and the code is messy...
    Well that is encouraging, dropping one bottleneck down to another of a different type is good progress. I'm working with an XML bottleneck myself, but the primary issue has been the retrieval of the XML file (comes from a third party site), not the processing (but I digress, my XML file is likely far smaller than yours at this point in time ~ roughly 1000-1500 elements in it).

  18. #43
    SitePoint Guru
    Join Date
    Jun 2006
    Posts
    638
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Some of mine are not that big either, but I have allot of them.
    Remember, each search gets 900 of these XML requests, and since they are a wrapper around an old school terminal, they come with a "next page" token... (so, 1 search = ~900*[page count])

    I get a small page (10-500KB), then sometimes I have to do a few next page requests, to get rest of the data... (so keep alive and reuse the connection to avoid 1000 handshakes speeds it up)

  19. #44
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    5,240
    Mentioned
    155 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Vali View Post
    Some of mine are not that big either, but I have allot of them.
    Remember, each search gets 900 of these XML requests, and since they are a wrapper around an old school terminal, they come with a "next page" token... (so, 1 search = ~900*[page count])

    I get a small page (10-500KB), then sometimes I have to do a few next page requests, to get rest of the data... (so keep alive and reuse the connection to avoid 1000 handshakes speeds it up)
    Yeah, I keep forgetting you have 900 responses If you ever want us to help optimize your XML parsing, just let us know. More than willing to give it a look, but I find, using xpath or simplexml_load_string typically do a fine job in this area.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •