I’m trying to write a script to import a CSV file into my database. The CSV has almost 5 million entries. The script below works except for one thing… it times out after about 5 minutes when it is only in the 500 thousands (usually record 540,000). Does anyone know what could be causing this? As you can see, I have the time limit set to unlimited.
No, it goes a lot more than 30 seconds. I checked the php_info and the defsult was 30 though.
What I did to solve the problem was create a redirect that passed the current ftell() value back to itself (then use fseek() to move to the position given by $_GET['fseek]) so that script execution would be restarted and resume from where it left off. Right now I have it set to do the redirect every 100,000 rows and was able to import the entire file of 4,488,248 records flawlessly.
It’s not an IIS server and I didn’t actually time it. I was just guessing. I do know that it consistently timed out around row 550,000. I could try setting ignore_user_abort, but what I have now seems to be working.
Depending on how your system is setup and how your script imports the data, a few possibilities could be the following settings:
max_input_time = 60 ; Maximum amount of time each script may spend parsing request data
memory_limit = 16M ; Maximum amount of memory a script may consume
post_max_size = 8M ; Maximum size of POST data that PHP will accept.
upload_max_filesize = 2M ; Maximum allowed size for uploaded files.
Also, if PHP is running in safe mode then set_time_limit() has no effect.
The thing is, even though I have error output turned on, there isn’t any output when it times out. This leads me to think it might have something to do with the browser (Firefox). I would try it from the command line, but don’t have shell access to this server. PHP isn’t running in safe mode either and the setting is clearly having an effect because it stops way after the default 30 second max exec time.
Basically, you insert the data into the database, then if the data need to be updated then you loop over it and create a temp table with the altered data (need a reference point to the placement of the data in the real table that you are updating), then finally update the real data with the data in the temp table.
A script doing what I mentioned above would insert and update the same data in a matter of seconds.
I see that you use a separate INSERT for each row - that is very inefficient for large data. Try using bulk statements, they are MUCH MUCH faster. You could modify your script to insert multiple rows in one INSERT statement - that is build up your INSERT query until it’s about 500KB large and then send it to the db. It might happen that you would be able to insert all your data in less than a minute (that’s just a guess).
INSERT statements that use VALUES syntax can insert multiple rows. To do this, include multiple lists of column values, each enclosed within parentheses and separated by commas. Example:
INSERT INTO tbl_name (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9);
Just be sure not to exceed the maximum query size which is 1MB by default.
Also, get rid of flush() and ob_flush() - they slow down your script a lot because it needs to wait for the browser to receive output before going to the next row. If there are 5 millions of them then I suppose you would have difficulty running this loop in 5 minutes even if you got rid of the INSERT statements at all!
Edit: I see now that you flush() your output every 5000th row - that is better but still for such long running scripts I would rather output any progress information to an external log file instead of sending it to the browser.
I actually switched to dreamhost because they allow ssh access. The above did however work on godaddy. I have since modified the table schema, so there is more processing required than I did in the above script. There is now a 3 table relationship, so categories have to added to the categories table, then an entry in the category map table has to be added along with a few checks (eg to see if a category exists).