Data Clear Unsuccessful?

Can anyone explain to me what has happened here please?

We have a very large customers table (32.75m rows, 12GB in size) which has a whole load of invalid data in it and I was tasked to reduce its size. We need to remove customers who have no email address recorded and are over 30 days old, but the query to get that data on the server will grind it to a halt.

So, I selected all of the rows, just the ID, email and dateadded, 1m at a time, into a CSV outfile. Using the CSV engine I ran an “INSERT SELECT FROM” and imported the records which met the criteria above into a MyISAM table. When this was done (took about 6 seconds per 1m row import file) I exported the IDs into another file.

This identified 28.3 million records which needed to be removed, but I needed to do an extra level of checking before deleting: Since I grabbed those IDs, have these customer records become valid? So, the script selected all rows that now had a valid email address and the ones that do were removed from the list to process. Then, any that remained were stripped from the basket and the customer tables and at the end a log was appended to tell me how many rows were deleted, how many per second, and how many valid customers were overlooked.

The process took a few days to complete, on and off, but last night we ran an optimize on the customer table, which took 2h 20min.

Now, my log file suggests that we’ve removed 28,261,695 in 24h 47min and 99.42% of the IDs I gave it were removed. So, we expected the count on the table to be something like 4.5m, however, it’s actually reporting 28,091,795 and 8.3GB still.

I’m going to get a million rows out again and run it through the same process that I did previously to compare the results, but if anyone could shed any light on what may have happened I would be most grateful.

Cheers

Your log file - what did you use to create it? Your own custom program / script? And how are you checking the size of the data file / number of records?

To be honest I made some assumptions in the log file. It’s done in PHP, and it looks like the assumptions were incorrect. I’ve changed the script to echo the queries and I’ve compared them to the list of values that were used in the query (basically something like “DELETE FROM customers WHERE id IN (1,2,3,4,5…)”. It appears that there was an error in my code’s logic and it broke out of loops too early and only did the first 100 IDs from each 1000, so it only removed about 10% of the IDs that we wanted it to.

What an idiot!

Well, I wouldn’t go that far and use such words, it’s a common mistake but you learned a lesson - always assume you’re wrong :slight_smile:

Also, if you are looking to improve performance of larger tables - I’d suggest ditching MyISAM as storage engine as InnoDB scales better and reading about partitioning in MySQL.

I’m only using MyISAM on my local machine because it’s quicker to import that data. All of our tables on the live servers are InnoDB, but cheers :slight_smile: