Asynchronous PHP? OK..maybe misleading, but need HELP speeding up a script!

I know PHP is, by nature, not asychronous…so here is my situation:
I currently have a CRON job that I want to run daily. It is a PHP script that cURLs multiple pages based on db entries. After adding a few lines to figure out each entry’s time to complete, I figure it will take over 1.5 days for the script to finish each time it runs…and that will only get longer.

I am trying to figure out a way to speed this up significantly… I was looking at pcntl_fork(), but I am not sure how/if I can implement it for my script. I am also not sure if it would speed it up enough to matter.

I am looking for any ideas on how I can speed this up and complete the job in MUCH less time. My webhost does not seem support PCNTL, so forking does not appear to be an option. I do know my script does work, it is just a matter of speeding it up. I will even incorporate several methods if they are compatable. Forking the process for each ID looks like it would help some, but I am not positive.

So here is the basic concept behind my script. There is no information printed out on a webpage. Some code omited to simply the view…concept of script intact.

$Email = new Email;
$Characters = new Characters;
$dbUtilities = new dbUtil;
$mycURL = new mycURL;

$db = $dbUtilities->dbConnect();  //Connects to the Database
$CharacterIDList = $Characters->Get_Character_List($db, $Email);  //Gets a list of all the CharacterIDs in the Character_Info Table

foreach ($CharacterIDList as $CharacterID){
        //cURL's character webpage and stores in $info variable
        $info = $mycURL->getWebPage('http://www.xxxxxxxxxxxxxx=Chr'.$CharacterID);

        //Scrape and update Character_Info table
        $Characters->Scrape_Character_Info($CharacterID, $info, $dbUtilities, $db, $Email);

        //Scrape and Update Character Name if Changed
        $Characters->Scrape_Character_Name($CharacterID, $info, $dbUtilities, $db, $Email);

        //Scrape and Update Character friends
        $Characters->Scrape_Character_Friends($CharacterID, $info, $dbUtilities, $db, $Email);

        //Scrape and Update Character enemies
        $Characters->Scrape_Character_Enemy_Players($CharacterID, $info, $dbUtilities, $db, $Email);
        $Characters->Scrape_Character_Enemy_Kingdoms($CharacterID, $info, $dbUtilities, $db, $Email);

        //Scrape and update Character City if changed
        $Characters->Scrape_City($CharacterID, $info, $dbUtilities, $db, $Email);

As far as forking is concerned…I have been testing my script by loading it in a webpage and watching it go with a temporary echo after each foreach loop. I tried a simple script for forking just to see what happened. I got the fatal error. I do not know if this would change if it was executed as a CRON job or not. The test I ran was the following:

if (! function_exists('pcntl_fork')) die('PCNTL functions not available on this PHP installation');
for ($i = 1; $i <= 5; ++$i) {
    $pid = pcntl_fork();

    switch($pid) {
        case -1:
            print "Could not fork!\
        case 0:
            print "In child $i\
            print "In parent!\

phpinfo() revealed that I have PHP version 5.2.14 installed. Beyond that, I am a noob with CRON and have never messed with PCNTL.

Can anyone offer/help this poor soul?

Splitting your script into more sub scripts might make it manageable.

instead of

$CharacterIDList = $Characters-&gt;Get_Character_List($db, $Email);  //Gets a list of all the CharacterIDs in the Character_Info Table

get a list of characterIDs that haven’t been scraped in a certain time period ( 12 hours) with a limit of 1000 at a time

Then do your scrapes of that 1000, mark them as scraped now()
cron your script to run every 20 mins so every 16 hours it would have checked 50k characterIds.

Change the times to what works for you. Maybe only check 500 every 5 mins or whatever allows your script to execute successfully.

Also instead of checking EVERY characterID all the time, limit yourself to only scraping users who logged in recently or if the scrape is a duplicate of last time X times in row then push it to a longer period between scrapes. The less data you have to scrape the easier it will be. Plus I’m sure that the site you a retrieving data would appreciate not being requested for all users all the time.

There’s no reason it should be that slow.

Profile your code to find the bottlenecks (probably the scrape functions).

The scrape functions simply run preg_match and then mysql commands. I will look into discovering a bottleneck.

The foreach loop seems to take an avg time of 3 seconds…but with 50k+ loops…that is a long time.

Sent from my DROIDX using Tapatalk

The preg_ functions are slow. Also the regular expression patterns you’re using might be very slow or inefficient.

It might be faster to use DOMDocument loadHTML() to scrape the HTML - depends what you’re doing.

Using MYSQL Improved could speed things up by reducing how much data you need to send to the server e.g. by using prepared statements and/or bound parameters. You could send multiple SQL statements at once (although if 1 row fails they all fail).

You could also batch process the cURL requests - this might help:

I ran a few tests to determine if there was a bottleneck. if the player only has a few enemies or friends, it runs fast enough…When a player has many friends and enemies…there comes the preg_match bottle neck.

I will have to see if I can use DOMDocument instead and see if it runs a bit faster.

On a test of 50 records I found out:

Script Start Time: June 10, 2011, 12:46 pm EDT
Script End Time: June 10, 2011, 1:04 pm EDT
Script Execution Time: 1046.124
Max Cycle Time: 125.507
Avg Cycle Time: 21.349

Max cURL Time: 2.211
Avg cURL Time: 0.81

Max Info Scrape Time: 0.007
Avg Info Scrape Time: 0.002

Max Name Scrape Time: 0.054
Max Name Scrape Time: 0.029

Max Friend Scrape Time: 43.538
Max Friend Scrape Time: 8.825

Max Player Enemy Scrape Time: 109.548
Max Player Enemy Scrape Time: 11.674

Max Kingdom Enemy Scrape Time: 0.015
Max Kingdom Scrape Time: 0.003

Max City Scrape Time: 0.119
Max City Scrape Time: 0.005

That could help quite a bit. I will look into doing that. Should not be too hard to change over to this.

There is not an easy way to determine the Character’s last log on, or if anything changed since last logged on. It is possible they lost a friend while logged off, and the info would have changed. I have contacted the developer of the site and asked if they would work in cooperation with me to minimize the bandwidth usage by granting me read-only access to their database instead of having to scrape the webpages. Hopefully that will pan out as it would be the best option all around. Meanwhile, onto trying to conquer this issue…just in case. If nothing else, I will learn a thing or two.

I ran another test. I isolated the foreach loop to run the same CharacterID 50 times. Using the CharacterID that produced the Player Enemy scrape time of 109 seconds.

Script Execution Time: 75.214
Max Cycle Time: 9.205
Avg Cycle Time: 1.504

Max cURL Time: 2.129
Avg cURL Time: 1.155

Max Info Scrape Time: 0.004
Avg Info Scrape Time: 0.003

Max Name Scrape Time: 0.086
Avg Name Scrape Time: 0.032

Max Friend Scrape Time: 0
Avg Friend Scrape Time: 0

Max Player Enemy Scrape Time: 8
Avg Player Enemy Scrape Time: 0.305

Max Kingdom Enemy Scrape Time: 0.016
Avg Kingdom Scrape Time: 0.008

Max City Scrape Time: 0.001
Avg City Scrape Time: 0

So running the one that caused the 109 scrape time now produced a max of 8 seconds and averaged .350 seconds. This leads me to believe the bottleneck is not going to be solved by coding it differently. Simply server lag? Or maybe I do need to go to MySQLi to query the sql statements. Either that, or I am just over thinking this whole thing.

It appears that you are getting game data of some type. Are you using this data in a collective format so that you “need” the data of all 50k users for it be useful?
Or is the data more useful on a player by player basis?

I use some websites that collect data from other websites and display it for me in a easy to use format. ALL of them wait until I login and then they use AJAX or some other method to ‘refresh’ the content they need to display the important information for me. If they have 1,000,000 users they don’t keep the data for all million current all the time. They only get it when each user logs in. This is much more efficient. I really don’t mind waiting the 15-30 seconds for it to grab the data and display it.

As we don’t know your exact use of the data it’s hard to say how you could make it more efficient.

I am getting game data. I am wanting to track certain changes over time. The game provider does not track this themselves, but I find it inportant to know who a player’s friends, enemies, prior kingdoms, etc were before deciding whether to accept them into my kingdom. The information will also be used to track their level progress over time to see if they level rapidly, or tend to take their time.

So, in this case, it is not ideal to pull the information as called basis.