How to Delete All Duplicated Records

maestro360 · November 17, 2011, 8:01pm

I have a Wordpress site where I’ve added functionality to show the location of users who’ve posted to the site.

I’m using Google’s geomap API to generate lat & lng coordinates. But I must’ve coded something wrong, because I’m getting 2 or 3 sets of coordinates for every location, and a ton of duplicates for each.

I first need to clean the duplicates out the database, but I’m unsure how to go about this. What complicates things additionally is how this particular Wordpress table (wp_postmeta) is built.

The table columns are meta_id, post_id, meta_key and meta_value.

meta_id is the pk, post_id is the fk back to the actual post, meta_key holds my values that define my meta_values data like “latitude” and “longitude”, and meta_value holds the actual data like 33.6839473 .

Combining the post_id and meta_key values will provide a unique identifier. Any records with duplicate values should be removed.

If anyone could provide me some ideas or related examples it would be greatly appreciated. Thanks!

binhosting · November 17, 2011, 8:43pm

Using PHPMyAdmin, this query will provide you with a list of duplicates.

select meta_id, post_id, meta_key, meta_value, count() from wp_postmeta
group by meta_id, post_id, meta_key, meta_value having count() > 1;

r937 · November 17, 2011, 9:09pm

dear binhosting, nice try – and welcome to sitepoint

since meta_id is the pk of that table, the COUNT(*) is never going to be anything other than 1, so that query will return bupkis

the list of duplicates is provided by this query –

SELECT t.meta_id
     , t.post_id
     , t.meta_key 
     , t.meta_value
  FROM ( SELECT post_id  
              , meta_key
           FROM wp_postmeta
         GROUP
             BY post_id  
              , meta_key
         HAVING COUNT(*) > 1 ) AS dupes
INNER
  JOIN wp_postmeta AS t
    ON t.post_id  = dupes.post_id     
   AND t.meta_key = dupes.meta_key

compare the subquery called “dupes” with this statement by the original poster…

Combining the post_id and meta_key values will provide a unique identifier. Any records with duplicate values should be removed.

however, i don’t think it should be “any” rows with duplicate values that should be removed… rather, “all but one”

maestro360 · November 17, 2011, 9:54pm

Yep, that’s a lot of duplicates.

Thank you both for your answers. r937, you’ve came through for me again. That’s a very interesting query. Creating a temp table to call a join on is something I haven’t seen before.

I can generate now a list of meta id’s for my dupes. I’ve tried altering the SQL to generate only the id’s then wrap it in its own subquery and delete everything within the list it generates.


DELETE wp_postmeta FROM wp_postmeta where meta_id in(
  SELECT t.meta_id     
  FROM ( SELECT post_id  
              , meta_key
           FROM wp_postmeta
         GROUP
             BY post_id  
              , meta_key
         HAVING COUNT(*) > 1 ) AS dupes
INNER
  JOIN wp_postmeta AS t
    ON t.post_id  = dupes.post_id     
   AND t.meta_key = dupes.meta_key
)

I get an alert saying I can’t specify the target table for update in a where clause. I’ve found this is because MySQL won’t update the same table that’s been used in a the SELECT part of the query.

Any suggestions on how to work around this?

r937 · November 18, 2011, 12:30am

before we go any further, could you please provide guidance regarding whether “all” rows that have duplicates should be deleted, or “all but one”

maestro360 · November 18, 2011, 12:33am

Sorry, the query should delete all but 1 unique entry for each post_id / meta_key pair.

guelphdad · November 18, 2011, 12:59am

Here is a simple way if you don’t care which pair is kept:


ALTER IGNORE TABLE yourtablenamegoeshere ADD UNIQUE(post_id,meta_key)

maestro360 · November 18, 2011, 3:10am

Thanks guelphdad. I’m learning a lot in this thread. Your suggestion is tempting. Only drawback is I have multiples of correct and incorrect coordinates for each post. Some posts would be left with correct coordinates, while others would have the correct data deleted while retaining the bad data. In either scenario I’d be left with only a single pair of coordinates, which would make it easier to administrate. I’m looking over my DB right now to see if this would be a good option.

guelphdad · November 18, 2011, 4:32pm

how do you mean correct and incorrect co-ordinates? and no that wouldn’t look after whether the co-ordinates are correct or not, but you didn’t specify that elsewhere in the thread.

while I go back and re-read your original post it sounds like you are using ENTITY ATTRIBUTE VALUE table structure which isn’t a good thing, perhaps you can give us some sample rows from your table(s) so we can assist you further, particularly to steer you AWAY from EAV if you are indeed using it since it makes many queries so much more cumbersome than properly normalized data strucutures.

r937 · November 18, 2011, 4:38pm

guelphdad, you’re not going to convince anybody that EAV is bad design when it’s built right in to wordpress

a big outfit like that, they surely must have known what they were doing, right?

anyhow, back to the dupes …

the combination of post_id and meta_key will have dupes, but the meta_value values will differ, so in order to eliminate the “incorrect” dupes, it is necessary to inspect all the dupes individually and make a human decision

therefore, the list of duplicates query in post #3 is probably as far as you can go, and deleting the incorrect dupes will have to be done by specifying the meta_id values to be deleted

guelphdad · November 18, 2011, 4:42pm

<off topic> Well the Maple Leafs must know what they’re doing too since they have all this money … yet the continue to employ Ron Wilson … </hockey rant>

maestro360 · November 19, 2011, 5:27pm

I mentioned in my original thread I had an error in my code that was generating 2-3 sets of coordinates and duplicates of each for every location. That’s what I meant about incorrect data. After considering both options I think it’s best to delete all the coordinates data, recode the google api and start over.

Thank you for all the comments in this thread. I’ve learned a lot. The table join on a subquery example was great, and the ALTER IGNORE TABLE command is also new to me.

I’d also been wondering what you would call the table structure WP is using for this data. I know now it’s referred to as an Entity Attribute Value structure. It’ll be easier to research problems now that I know its name.

I agree with your dislikes to the EAV structure. It makes it hard to call all the related data you need, and I’d imagine this table will quickly get bloated. I think they’ve designed it this way to make it more flexible for WP so it can be used for different types of sites, while disregarding to the ease of development. I’m ok with the extra work, but I’d imagine this would also hurt site performance in the long run as the database gets populated and this EAV table records multiple rows for every single record in the posts table.

I know it’s off topic from the original question, but even with efficient SQL, would a heavily populated EAV table perform slower than a normalized table?

r937 · November 19, 2011, 7:36pm

not necessarily

just likely