PHP script timeout while reading large file

penguinator · June 13, 2008, 8:01am

I’m trying to write a script to import a CSV file into my database. The CSV has almost 5 million entries. The script below works except for one thing… it times out after about 5 minutes when it is only in the 500 thousands (usually record 540,000). Does anyone know what could be causing this? As you can see, I have the time limit set to unlimited.

Any help appreciated.

&lt;?php
set_time_limit(0);
ini_set('display_errors','On');
mysql_connect(connection info removed);
mysql_select_db(removed);

$file_handle = fopen($_GET['path'], "r");
$i=0;
while ($line_of_text = fgetcsv($file_handle, 1024)) {

	$sic = explode(' - ',trim($line_of_text[0],'"'));
	$sic = $sic[0];
	$company = trim($line_of_text[1],'"');
	$address = trim($line_of_text[2],'"');
	$city = trim($line_of_text[3],'"');
	$state = trim($line_of_text[4],'"');
	$zip = trim($line_of_text[5],'"');
	$county = trim($line_of_text[6],'"');
	$url = trim($line_of_text[7],'"');
	$phone = trim($line_of_text[8],'"');

	mysql_query('INSERT INTO links (`category`,`url`,`name`,`active`,`date`,`zip`,`city`,`state`,`phone`,`address`,`src`,`from_file`)
VALUES ("'.mysql_real_escape_string($sic).'","'.mysql_real_escape_string($url).'","'.mysql_real_escape_string($company).'","true",NOW(),"'.$zip.'","'.$city.'","'.$state.'","'.$phone.'","'.mysql_real_escape_string($address).'","csv","'.$_GET['path'].'")') or die(mysql_error());

	if(!strstr($i/5000,'.')){
		print 'Importing record #: '.$i.'&lt;br /&gt;';
		flush();
		ob_flush();
	}

	$i++;
}

fclose($file_handle);

print 'Total: '.$i;

?&gt;

affordablemagic · June 13, 2008, 8:16am

I wonder if it is a config issue. Looking at the first few paragraphs of:

http://uk2.php.net/function.set-time-limit

I suggests that set_time_limit just adds (or not if set to 0) the timeout value set in max_execution_time set in php.ini.

The default seems to be 30 secs., does your script stop after about 30 secs?

penguinator · June 13, 2008, 7:46pm

No, it goes a lot more than 30 seconds. I checked the php_info and the defsult was 30 though.

What I did to solve the problem was create a redirect that passed the current ftell() value back to itself (then use fseek() to move to the position given by $_GET['fseek]) so that script execution would be restarted and resume from where it left off. Right now I have it set to do the redirect every 100,000 rows and was able to import the entire file of 4,488,248 records flawlessly.

Thanks.

penguinator · June 13, 2008, 7:48pm

Here’s my code… could be of use to someone trying to import a huge CSV…

&lt;?php
set_time_limit(0);
ini_set('display_errors','On');
mysql_connect(removed);
mysql_select_db(removed);


$file_handle = fopen($_GET['path'], "r");
if(isset($_GET['ftell'])){
	fseek($file_handle,$_GET['ftell']);
}
$i=0;
if(isset($_GET['x'])){
	$x=$_GET['x'];
} else {
	$x = 0;
}
while ($line_of_text = fgetcsv($file_handle, 1024)) {

	$sic = explode(' - ',trim($line_of_text[0],'"'));
	$sic = $sic[0];
	$company = trim($line_of_text[1],'"');
	$address = trim($line_of_text[2],'"');
	$city = trim($line_of_text[3],'"');
	$state = trim($line_of_text[4],'"');
	$zip = trim($line_of_text[5],'"');
	$county = trim($line_of_text[6],'"');
	$url = trim($line_of_text[7],'"');
	if(!empty($url)){
		$url = 'http://'.$url;
	}
	$phone = trim($line_of_text[8],'"');
	if($company !='COMPANY NAME'){
		mysql_query('INSERT INTO links (`category`,`url`,`name`,`active`,`date`,`zip`,`city`,`state`,`phone`,`address`,`src`,`county`)
VALUES ("'.mysql_real_escape_string($sic).'","'.mysql_real_escape_string($url).'","'.mysql_real_escape_string($company).'","true",NOW(),"'.$zip.'","'.$city.'","'.$state.'","'.$phone.'","'.mysql_real_escape_string($address).'","csv","'.$county.'")') or die(mysql_error());
	}
	if(!strstr($i/5000,'.')){
		print 'Importing record #: '.$x.'&lt;br /&gt;';
		flush();
		ob_flush();
	}


	if($i==100000){
		print '&lt;meta http-equiv="Refresh" content="0; url='.$_SERVER['PHP_SELF'].'?x='.$x.'&amp;ftell='.ftell($file_handle).'&amp;path='.$_GET['path'].'"&gt;';
		//print '&lt;a href="'.$_SERVER['PHP_SELF'].'?x='.$x.'&ftell='.ftell($file_handle).'&path='.$_GET['path'].'"&gt;Continue importing&lt;/a&gt;';
		exit;
	}
	$x++;
	$i++;
}

fclose($file_handle);

print 'Total: '.$x;

?&gt;

affordablemagic · June 13, 2008, 9:17pm

Out of curiosity, is the version of PHP you’re executing CGI?

penguinator · June 13, 2008, 9:34pm

It says on the phpinfo page:

Server API CGI/FastCGI

affordablemagic · June 13, 2008, 10:09pm

OK, if the script expires in approx. 5 min, then that is the issue w/ your script. The CGI setting in IIS is, by default set to timeout at 300 sec., or 5 minutes.

My guess is, the script ended at 5 min right?

penguinator · June 14, 2008, 4:33am

It’s not an IIS server and I didn’t actually time it. I was just guessing. I do know that it consistently timed out around row 550,000. I could try setting ignore_user_abort, but what I have now seems to be working.

devised · June 14, 2008, 6:19am

Depending on how your system is setup and how your script imports the data, a few possibilities could be the following settings:

max_input_time = 60 ; Maximum amount of time each script may spend parsing request data
memory_limit = 16M ; Maximum amount of memory a script may consume
post_max_size = 8M ; Maximum size of POST data that PHP will accept.
upload_max_filesize = 2M ; Maximum allowed size for uploaded files.

Also, if PHP is running in safe mode then set_time_limit() has no effect.

penguinator · June 14, 2008, 8:37am

The thing is, even though I have error output turned on, there isn’t any output when it times out. This leads me to think it might have something to do with the browser (Firefox). I would try it from the command line, but don’t have shell access to this server. PHP isn’t running in safe mode either and the setting is clearly having an effect because it stops way after the default 30 second max exec time.

affordablemagic · June 15, 2008, 7:00am

You may be able to try setting the timeout directive for apache:

http://httpd.apache.org/docs/2.0/mod/core.html#timeout

Or even mod_fastcgi for Apache:

http://fastcgi.com/

Looking at your stat above the script is processing 1833 records per second for 5 minutes if the timeout is 300 sec?

TheRedDevil · June 15, 2008, 2:20pm

I assume you managed to insert the data with your “hack”. If you need to insert data like this in the future its much easier using the correct way to insert large data sets into the database.

For mysql, look up the LOAD DATA INFILE settings
http://dev.mysql.com/doc/refman/5.0/en/load-data.html

Basically, you insert the data into the database, then if the data need to be updated then you loop over it and create a temp table with the altered data (need a reference point to the placement of the data in the real table that you are updating), then finally update the real data with the data in the temp table.

A script doing what I mentioned above would insert and update the same data in a matter of seconds.

affordablemagic · June 15, 2008, 5:02pm

Now THAT is a great resource!

Lemon_Juice · June 15, 2008, 5:27pm

I see that you use a separate INSERT for each row - that is very inefficient for large data. Try using bulk statements, they are MUCH MUCH faster. You could modify your script to insert multiple rows in one INSERT statement - that is build up your INSERT query until it’s about 500KB large and then send it to the db. It might happen that you would be able to insert all your data in less than a minute (that’s just a guess).

http://dev.mysql.com/doc/refman/5.0/en/insert.html

INSERT statements that use VALUES syntax can insert multiple rows. To do this, include multiple lists of column values, each enclosed within parentheses and separated by commas. Example:

INSERT INTO tbl_name (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9);

Just be sure not to exceed the maximum query size which is 1MB by default.

Also, get rid of flush() and ob_flush() - they slow down your script a lot because it needs to wait for the browser to receive output before going to the next row. If there are 5 millions of them then I suppose you would have difficulty running this loop in 5 minutes even if you got rid of the INSERT statements at all!

Edit: I see now that you flush() your output every 5000th row - that is better but still for such long running scripts I would rather output any progress information to an external log file instead of sending it to the browser.

penguinator · June 24, 2008, 11:50pm

I actually switched to dreamhost because they allow ssh access. The above did however work on godaddy. I have since modified the table schema, so there is more processing required than I did in the above script. There is now a 3 table relationship, so categories have to added to the categories table, then an entry in the category map table has to be added along with a few checks (eg to see if a category exists).

h0ll0vvman · June 25, 2008, 8:18am

Why not use bigdump?

penguinator · June 26, 2008, 2:40am

I essentially copied bigdump with the second method I tried. That is, refreshing the page after a certain amount of time to prevent a timeout. It’s just that bigdump has a lot more fancy javascript and AJAX.

huanxiangwu · June 26, 2008, 3:33am

pagination