Hi all,

Would appreciate some help if anyone could advise:

I have a mySQL table that records ‘log’ entries for a set of products.

I’m trying to create a query that will retain up to 20 of the most recent entries per unique product.

I have tried a few approaches but none so far seem to be the optimum.

The table is updated all the time, so there are a lot of records in there at the moment. There are approx 10,000 unique products, and lets say there are currently 100 entries each. Daily I wish to clean this up and only keep the most recent 20 records for each unique product. There is a datetime field that can be used to assess a record’s age.

If anyone can assist it would be appreciated.

Thank you!

structure of table would be very helpful

actually knowing only the PK and the datetime column, all the others can be foo, bar, etc.

Hi, thank you for the replies, please see below:

| Field            | Type          | Null | Key | Default | Extra          |
| id               | int(11)       | NO   | PRI | NULL    | auto_increment |
| productsku       | varchar(15)   | NO   | MUL | NULL    |                |
| time             | datetime      | NO   |     | NULL    |                |
| entry            | varchar(255)  | NO   |     | NULL    |                |


what about:

           , productsku
           , entry
   FROM tablename
     BY time DESC LIMIT 20 

At least if I understand you right?

Thanks donboe, yes that would select what I want to keep, but I need to delete the rows I don’t want to keep. So for clarification i’d like to delete everything except the result of the SELECT you gave - hope that makes sense?

donboe, DISTINCT is ~not~ a function, you cannot apply it to a single column like that, it applies to all columns in the SELECT list

1 Like

[quote=“Boreas, post:1, topic:97427, full:true”]I’m trying to create a query that will retain up to 20 of the most recent entries per unique product.[/quote]let’s be clear… do you want a SELECT query to return the 20 most recent entries per product, or do you want a DELETE statement that will delete all the other ones?

deleting old/actual/historic data is seldom a good idea…

Thanks for the reply. I’d like a DELETE to delete all but the most recent 20 entries for each unique product. Don’t need to return them. If not a good idea to delete what would be the alternative? The table gets an immense amount of entries per day, and the last 20 are the only relevant ones.

As i’ve been unable to figure this out the table currently has over 2.5 million rows. Thank you for helping

are you really sure you want to delete all the older ones?

that you’ll never need to go back and check something happened in the past on one of the products?

in any case, what you’ll end up doing is writing a SELECT query to pull the most recent 20 for each product

you can use this query to create a new table, then you would remove all the rows from the original table, then copy back the ones you’ve saved

it’s not going to be pretty

how often do you think you’ll want to do this?

This should work:

DELETE FROM tablename
    WHERE time < (SELECT MIN(time) FROM
        (SELECT time FROM tablename ORDER BY time DESC LIMIT 20) dt);

However, if you have millions of rows to delete I’m not sure how efficient this would be since mysql is known for poorly optimized subqueries in the WHERE clause. If you find this performs poorly then splitting the process into separate queries might help:

SET @pasttime=(SELECT MIN(time) FROM
    (SELECT time FROM tablename ORDER BY time DESC LIMIT 20) dt);
DELETE FROM tablename WHERE time < @pasttime;

In any case, it would be good to have the time column indexed.

Edit: sorry, I missed the requirement for 20 unique products so my solution is incomplete…

Thanks for the replies

Yes its something that is changing and moving so often that only last few events are relevant.

The first time is to ‘clean up’ as the system has been adding records for so long with this issue unresolved hence why there are so many records in there at the moment. I was thinking to run a cron daily to keep it prim, however if the solution you are suggesting is the only one then it sounds like I need to do that first to clean up, then change the funcs that actually do the INSERT of records to run a clean up on just the product it is writing about at the time?

Thanks for the effort Lemon_Juice and the advice on the datetime col index

latest 20 rows per product –

SELECT id , product_sku , `time` , entry FROM daTable AS t WHERE ( SELECT COUNT(*) FROM daTable WHERE product_sku = t.product_sku AND `time` > t.`time` ) < 20
create a second table, exact same layout, and stick INSERT INTO in front of this SELECT to save the ones to keep

then delete from your main table, and copy these back

reaason for doing it in two steps: mysql cannot DELETE from the same table used in a subselect

Thank you r937 for all your help. Will post again with time it took to do this first pass over, for those that may be interested.

Are you sure about that? Look at the first query in my post above - it is not doing exactly what the OP wants but it uses a subselect that uses the same table as the DELETE statement. I’m not sure about performance but it works.

You have:

  • a log table
  • updated all the time
  • with a lot of records

You want to:

  • clean up
  • keep the most recent records

It will take long to delete when a lot of records, especially if updated all the time, so you should not do this on a daily basis.

You could partition the table based on time series, you could create a view for the latest 20 or so most recent entries and you should clean up the log only when tablespace grows too large.

Thanks for replies

Yes that sums it up vion9929 , thanks for input. Interesting comment about partitioning, I hadn’t considered that. Will do an initial clean up (I’m going to do this tonight hopefully) and then decide on how best to proceed from there

With INSERT INTO table2 in front now been running over 15 mins, all crons are currently paused so there will be no updates to original table whilst this is running. Nothing in “table2” yet. There are about 3M records and about 10,000 unique product_sku’s in “table1”

Fire up an app like phpMyAdmin and have a look at table2 to confirm if anything has been written to that table

With phpMyAdmin open it was interesting to see that on the Structure view the cardinality was increasing (very slowly). A COUNT(*) on table2 returned zero. I imagine the data was still in memory and not yet written to table2. After the query running for 2 hours, there was only 10,000 rows “in” table2. Given the fact there are 10,000 products, and I was after the most recent 20 log entries, I expected to get 200,000 rows when complete.

As this was taking so long and making little progress I killed the query and switched crons back on.

After careful consideration (and knowing I have a full back up should it be required) I truncated table1 of it’s 3M rows.

So the moral here for me at least is don’t get into that situation again. I will now either implement a partition as suggested by vion or have a “clean up before write to log” to keep it from going over 20 entries per product at the time of writing entries. Not sure how thats going to sit performance wise but I don’t seem to have many options,

Thanks all for the input