SitePoint Sponsor

User Tag List

Results 1 to 18 of 18

Thread: Increase speed?

  1. #1
    SitePoint Enthusiast
    Join Date
    Apr 2012
    Posts
    70
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Increase speed?

    I have a script that builds an array of about 150 IDs, which then uses file_get_contents() to gather data associated to that ID. I have used CURL, and they are the same speed.

    I am trying to conjure up a way to make the following code faster. I'm considering switching the DB to InnoDB so I can run multiple instances at once, but it seems like there's probably a better way. Ideas?

    PHP Code:
    $Array = array(12345);
    foreach(
    $Array as $ID) {
        
    $file file_get_contents('http://www.example.com/index.php?id='.$ID);    
        
    // INSERT data into database


  2. #2
    Keeper of the SFL StarLion's Avatar
    Join Date
    Feb 2006
    Location
    Atlanta, GA, USA
    Posts
    3,748
    Mentioned
    73 Post(s)
    Tagged
    0 Thread(s)
    Standard Disclaimer Question: Do you have permission to be screen-scraping this data and storing it?
    Never grow up. The instant you do, you lose all ability to imagine great things, for fear of reality crashing in.

  3. #3
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    5,235
    Mentioned
    154 Post(s)
    Tagged
    0 Thread(s)
    After contemplating the answer to StarLion's question, feel free to give this thread a read.

  4. #4
    SitePoint Enthusiast
    Join Date
    Apr 2012
    Posts
    70
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    It's public, industrial data that requires no login. No images or wording, only numbers and dates. I asked one department for a database that doesn't change instead of sending millions of hits to their server, and they just say "Just send the requests to the server individually." Ok, fine by me lol.


    Honestly, for my purposes, that thread didn't really have any ideas. I did retest CURL though, and it is a sliver faster than file_get_contents(). The majority of the time is all spent sending and waiting for the file contents, so I am thinking multiple instances could speed it up big time. I could have a CRON script run 10 different instances for different sets of updates.

  5. #5
    SitePoint Evangelist captainccs's Avatar
    Join Date
    Mar 2004
    Location
    Caracas, Venezuela
    Posts
    516
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Your script will be limited by the two factors neither of which you have control over (that I know of): the source server's latency and your connection's bandwidth.
    Denny Schlesinger
    web services

  6. #6
    @php.net Salathe's Avatar
    Join Date
    Dec 2004
    Location
    Edinburgh
    Posts
    1,397
    Mentioned
    65 Post(s)
    Tagged
    0 Thread(s)
    The problem looks to be that you're downloading the remote content serially, i.e. one at a time. You have to wait for the first to have finished before starting on the second, and so on. The key here is to make the requests in parallel: all of them (or in chunks) at the same time. That way the total time taken is (optimally) only the time of the single slowest request.

    This can be done in various ways. You could spawn many instances of your script at the same time, each fetching from one URL only. There is also cURL's "multi" interface, which allows sending off and receiving many cURL requests simultaneously. It's a bit of a faff, but a good starting point is the curl_multi_exec() PHP manual page.

    If you're going to be making ~150 requests pretty much simultaneously, you had better make sure that the content provider is really OK with it. That said, there's no reason why you can't artificially "slow" the requests (only send N requests at once, for example) and still be much faster than getting the URLs one at a time.
    Salathe
    Software Developer and PHP Manual Author.

  7. #7
    SitePoint Evangelist captainccs's Avatar
    Join Date
    Mar 2004
    Location
    Caracas, Venezuela
    Posts
    516
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Parallel vs. serial is a good point but php wasn't designed to work in parallel (that I'm aware of). Another solution is to get the data asynchronously, say the night before with a cron-jobs and then run the main script with "local" data. One advantage of having local data is that until the source file is updated here is no need to download the data again.
    Denny Schlesinger
    web services

  8. #8
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    67 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by captainccs View Post
    Parallel vs. serial is a good point but php wasn't designed to work in parallel (that I'm aware of).
    You can run multiple "processes" in php, but not exactly threading. I've written a class to handle "multi-processing" that extends a shared memory segment, but it's not completely done yet (the shared memory portion). Down side is that this will be for CLI work only, apache doesn't really support it.

  9. #9
    SitePoint Enthusiast
    Join Date
    Apr 2012
    Posts
    70
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well both you guys obviously know more than me. The script time is 90% waiting for the data from the remote site. My server resources aren't being exhausted because it can't speed up the connection time. I figured I could run the multiple instances and exhaust more resources, but if you say it's not designed to do that then I just waste all my time building it lol.

    My server has 16GB memory and quad 3.4GHz CPUs, and the script is using like 3% of that.

  10. #10
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    What is the frequency of this data fetch?

    Daily, hourly, on page load?

  11. #11
    SitePoint Evangelist captainccs's Avatar
    Join Date
    Mar 2004
    Location
    Caracas, Venezuela
    Posts
    516
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    So much to learn...
    Command Line Interface and PHP5 Multithreading:
    If you have a bottle neck in database or network connection then you can speed up your script up to 1000% just by implementing PHP5 multithreading. For example, you may spend 10 seconds just to establish http connection when you fopening remote http page and just 1 second to retrieve the content. If you need to fopen 1000 pages one by one then you will spend 10*1000+1*1000 = 11000 seconds (3 hours and 3 minutes)! If you run 100 threads then you will spend (10*1000+1*1000)/100 = 110 seconds (less then 2 minutes!). Obviously, you will need powerful enough CPU, enough memory and network bandwidth.
    PHP CLI
    Denny Schlesinger
    web services

  12. #12
    SitePoint Enthusiast
    Join Date
    Apr 2012
    Posts
    70
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hah well I'm not a professional programmer. I'm only building tools I need. CLI is definitely going to be some learning for me, but I'm sure I'll get it done. Thanks Denny!


    Cups, it's fetching state industrial information. It fetches information on about 20 thousand wells. I'd like to do it weekly, since I have other websites I'm gathering data from that I want to spread out through the week and run at night.

  13. #13
    SitePoint Enthusiast
    Join Date
    Apr 2012
    Posts
    70
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ouch, I just read "Unlike the CGI SAPI, CLI writes no headers to the output by default". If this is the case, then it wouldn't work since headers are required in some cases.

  14. #14
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,182
    Mentioned
    67 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by captainccs View Post
    So much to learn...

    PHP CLI
    This is incorrect. There is no "threading" in PHP. You can however, fork a new "process", which may mimic threading to those who do not know the difference. Have a look here: http://stackoverflow.com/questions/1...cess-vs-thread. But with that being said, forking a new process is almost as beneficial as threading, the loss is in the lack of communication between your threads (hence my extension of a shared memory segment).

    Now, apache also has things to say about pcntl_fork(); It can have unwanted results. What those results are I'm unsure as I have not had a reason to fork anything through an apache request, it has all be CLI work for me thus far.
    We need to know a little bit more about these requests. Cups is on the right track with frequency, and I'm also curious about payload size.

  15. #15
    SitePoint Enthusiast
    Join Date
    Apr 2012
    Posts
    70
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Payload size is about 400KB.

  16. #16
    SitePoint Guru
    Join Date
    Nov 2003
    Location
    Huntsville AL
    Posts
    701
    Mentioned
    4 Post(s)
    Tagged
    1 Thread(s)
    Seems like you should start with a simple test.

    Run the script from the first post from the command line. Don't do anything with the data, just bring it down. Now run several copies of the script from the command line at the same time. Probably want each script to bring down a different set of random id's.

    That will tell you right away if running something in parallel will help. It could be that the example.com server itself is the bottleneck.

  17. #17
    @php.net Salathe's Avatar
    Join Date
    Dec 2004
    Location
    Edinburgh
    Posts
    1,397
    Mentioned
    65 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by biglittle View Post
    but if you say it's not designed to do that then I just waste all my time building it lol.
    PHP is designed to allow what I suggested: it is not misusing PHP at all. If you follow one, or both, of the suggestions that I made, let us know how you get on.
    Salathe
    Software Developer and PHP Manual Author.

  18. #18
    SitePoint Enthusiast
    Join Date
    Apr 2012
    Posts
    70
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Salathe, I'm going to look into what you suggested. I've been too busy to write up a test for this though. If it works (which it looks like it should) it would be the easiest way to implement by far.

    I'm definitely not going to bombard the remote server with requests, since I don't want them to implement a way to stop me. The page is typically 300KB-1MB (all files), so I don't think it would be a big issue to send 25 requests at once. I have about 8k requests total on each site, and they've never had a problem with me sending a request about every second. I run it at night too.

    Thanks for the idea Salathe.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •