Hello developers --
Coming down to the final tweaks on an application and we've started to import data... lots of data... and the process, while fast, could be better [4000-5000 records / minute with over 10,000,000 to process]. Hoping that you all can offer up some suggestions on what to optimize.
The import consists of around 1500 XML files ranging from 10mb to 1gb each with a series of products & corresponding stores selling said products.
- Gather an associative array with the key of [known] products - $bigproducts
- Loop through each category, download, extract (gunzip) the product level file
- Loop through the products
- Check if the product exists by identifier key in $bigproducts
- If the identifer is not found, a query is run on 5 indexed fields looking for the item.
- No matches: insert a db record within the loop; add the key to $bigproducts
- Matches: skip and move on
- Access the store-product that matches each product
- Take the product id and run an INSERT with DUPLICATE KEY UPDATE
Of course there's a ton of secondary work in data cleanup, managing categories and such but that's the high-level logic.
All of the php has been tightly managed to prevent memory leaks and we see less than 100 bytes used per record, all released back through the cycle and just a small build up over the course of 100,000 product imports.
As I see it the major bottlenecks are in the queries and the inserts.
Each product not in the array [which is everything on a new category] gets a query. Products that are truly new product get an immediate insert. Individual store product records don't query data but do a forced INSERT / UPDATE every time.
We've considered batching the inserts but we're talking about thousands of inserts per category and also an impact to data quality in not finding duplicates.
We've also considered killing the indexes on the table during the process but as it's actively adding records to search that also seems bad.
Any thoughts on how to kick things up to a faster state? Or what to do to measure where it may be slow?