Web Scraper Speed Up Tips

Hello Guys,

I am running a website where the site downloads game information from a site at about a speed of 1 game per 6.8 seconds.

I am using the simplehtmldom package to handle the parsing of the page, and currently doing one whole game (which turns out to be a long page) in one go.

To make my crawler operate, I have a cron set to go off every minute, and the script is set to loop through 10 games, which takes about 55 seconds currently.

Does anyone know any tips or suggestions on how to make this process faster, as I know of previous sites which did the practically the same thing being able to collect games at about 4 games per second.

Looking for any suggestions.

Thanks.

You need to make the obtaining, and processing of content separate processes (as Cups stated). This will also allow you to identify and optimise bottlenecks.

Have a process to obtain the data and save it locally if successful. In the second process, check to see if that, local, file has been processed - if not process it and mark it processed.

Oh wait, I could just point the file1, file2, file3, file4 to the php file to scrape the page, right? ha

Ok, I understand that. I have already broken the process into two processes, and I am able to run a cron every minute and it can download 28 pages a minute.

What I do not understand is the curl_multi_exec code, obviously I would use this for processing since it can run more than one “url” at the same time, but how would I alter the code snippet above to initiate a script to scrape the page?

Try and use wget to fetch the page, then do the analysis and ripping as a completely separate operation on your own server, but only if the wget actually refreshed your copy of the page, for example.

Would rewriting the script solely using DOM parsing dramatically increase the speed?

Yes, its within legal rights as described by there policy.

That would be my first guess also. The site doing the same thing probably isn’t using SimpleHTMLDOM, perhaps just achieving everything manually to eliminate the added bulk of a third-party library.

Would this increase speed? Is it faster to scrape a page from a database than on-the-fly.

Every page being scraped is always unique, I check for that before hand with an ID number.

Key word: parallelism.

Don’t try and do things in sequence, fire off a whole bunch of requests and/or response-processors (there are myriad ways to do this, which it would be advised to try a few yourself to find what fits).

Maybe curl_multi_exec could interest you. Check out the user notes too. :slight_smile:

Additionally, Kore Nordmann’s [URL=“http://github.com/kore/njq”]NJQ (Native Job Queue) could come in handy if your using 5.3

If i used this code, found in the notes:

<?php
 $locations = array(
 "file1" => "[url here]",
 "file2" => "[url here]",
 "file3" => "[url here]",
 "file4" => "[url here]"
 );

$mh = curl_multi_init();
$threads = null;
foreach ($locations as $name => $url) 
{
        $c[$name]=curl_init($url);
        $f[$name]=fopen ($name.".xml", "w");
        curl_setopt ($c[$name], CURLOPT_FILE, $f[$name]);
        curl_setopt ($c[$name], CURLOPT_TIMEOUT,600);
        curl_multi_add_handle ($mh,$c[$name]);
}

$t1 = time();

do 
{
    $n=curl_multi_exec($mh,$threads);
    if (time() > $t1 + 2)
    {
        echo "keep-alive" ."<br/>";
        $t1 = time();    
    }
}
while ($threads > 0);

foreach ($locations as $name => $url) 
{
    curl_multi_remove_handle($mh,$c[$name]);
    curl_close($c[$name]);
    fclose ($f[$name]);
}
curl_multi_close($mh);

?> 

Where would I add my code to process/scrape the page, in the first foreach loop?

Could you expand on this? Im parsing the page in order, not jumping up and down but I dont fully understand what you mean.

I’ve never used SimpleHTMLDOM but perhaps look at different parsers that might be more efficient?

if you just need a small portion of a page, perhaps it’s not so efficient to have something parse the entire page, convert it into DOM objects.

First of all, do you have permission from the external website owner to use their information?