Why does my script hang sometimes?

I have a problem with an RSS scraper I wrote a while back. It connects to several RSS URLs, scrapes the items and puts them in a database (MySQL).
This script runs every 10 minutes, which should be more than enough time to scrape all the RSS feeds it’s supposed to scrape (currently 78, giving it an average of 7,7 seconds per feed).

The scraping and storing the items in the database is not a problem, that works fine. The whole script runs fine most of the time.

However, sometimes it can’t get the job done in 10 minutes, and what’s even worse it can keep running for hours on end. When that happens the script takes MySQL down with it and the server load (dual core) spikes to around 50. Bad.

The outline of the script is below:


set_time_limit(10);
function removePIDFile()
{
	file_exists('cronrunner.pid') && unlink('cronrunner.pid');
}
register_shutdown_function('removePIDFile');

if (file_exists('cronrunner.pid'))
{
	die();
}
if (($fp = fopen('cronrunner.pid', 'w+')) === false || fputs($fp, '1') === false)
{
	die();
}
fclose($fp);

$timeout = 5;
ini_set('default_socket_timeout', $timeout);
ini_set('magic_quotes_gpc', false);
ini_set('magic_quotes_runtime', false);

// download and process RSS feeds here

The whole cronrunner.pid thing is to ensure that no other process is also scraping the RSS feeds at the same time. I didn’t have this mechanism in before and when one of them started to hang then the server load would spike to around 200 or 300 :eek:

As you can see I have a time limit of 10 seconds imposed on the script (for testing purposes) and I’ve set default_socket_timeout (I’m using fopen()) to 5 seconds.
With that set, how is it possible the script can run longer than that? My suspicion is that a certain host where the RSS needs to be scraped from is unreachable sometimes, which causes fopen() to “hang”.

Does that make sense? If so, how could I solve this? Would it for example help if I started using cURL instead of fopen() ?

I do something similar, and I learned to completely separate each of the jobs.

a) get the stuff you need and store it (ie cache it if you will)
b) process it, possibly if the date on the cache says it needs doing, though this may depend on your need

Anyhow, process b) is entirely separated from process a) because a) has so many failure points in it.

And yes, I’d just cURL the lot or even wget them, having got cron to call that for me.

Cron calls something that does a) at 00:10 and at 00.11 does b).

If the 22nd cache file has not been refreshed either skip it or process it depending on your strategy.

I am not entirely sure this deals with the crux of your problem, but I never have trouble with this kind of operation any longer - so recommend it.

set_time_limit() doesn’t guarantee the real maximum script execution time. The docs say:

The set_time_limit() function and the configuration directive max_execution_time only affect the execution time of the script itself. Any time spent on activity that happens outside the execution of the script such as system calls using system(), stream operations, database queries, etc. is not included when determining the maximum time that the script has been running. This is not true on Windows where the measured time is real.

It is very likely that the time spent on connecting to the other server and transferring the data is not counted towards the execution time. The same phenomenon applies to sleep() function. Therefore, a workaround is necessary - for example, save the current time() at the beginning of the script and then check it on each loop (RSS connection) execution and break the loop when the maximum time is exceeded.

However, for this to be fully successful the timeout on fopen and fread, etc. should also work well. I don’t know how reliable default_socket_timeout is but as an alternative you can try file_get_content() with stream_context_create() or curl.

Good point. I’d never thought of separating the different steps, but I’m starting to warm up to it!

wget! Now there’s an idea! That program has been tried and tested for so long I can be sure that if the system starts acting up again it won’t be due to wget (well, most likely anyway).

What do you mean by “22nd cache file” ?

Yes it makes a lot of sense; I’ll take this route. Thank you! :slight_smile:

I actually read that one time and totally forgot about it. Thanks for pointing it out :slight_smile:

That also makes sense, but I’m going with separating downloading from processing using wget to download, which should overcome this problem entirely if I’m not mistaken.

Yes, nowadays I only use cURL because it’s just more reliable than fopen() and you never run in the problem where hosts have allow_url_fopen (or something like that) disabled. Like I said, this is kind of a legacy system (which gives me the “I didn’t know any better back then” excuse ;))

Thanks guys!

What do you mean by “22nd cache file” ?

I just meant the nth file you go and get and then cache, sorry not very clear.

Ah, it makes sense now. Thanks :slight_smile: