Eliminating Duplicate Records

Shaun · April 22, 2010, 12:07pm

[FONT=“Georgia”]Hello.

I have a huge database with several duplicate entries.

What is the simplist query I can write to select and delete the duplicates, leaving behind only one single record?

I’m guessing I’ll need to use a GROUP BY somewhere in there, but I’m honestly baffled about the rest.

Making things trickier is that there are actually two fields I’ll need to look at in order to determine if a record is truly a duplicate or not.

Any suggestions?

[/FONT]

Shaun · April 22, 2010, 12:18pm

[FONT=“Georgia”]Just to elabourate; Each record does have a unique ID. But the rest of the data is what’s dupicate in some cases.

Example below;[/FONT]

_______________________________
| id | title       | country  |
===============================
|  1 | abrasives   | trinidad |
[COLOR="Red"]|  2 | abrasives   | trinidad |
|  3 | abrasives   | trinidad |[/COLOR]
|  4 | abrasives   | tobago   |
|  5 | accountants | trinidad |
|  6 | accountants | tobago   |
[COLOR="Red"]|  7 | accountants | tobago   |
|  8 | accountants | tobago   |
|  9 | accountants | tobago   |[/COLOR]
| 10 | actuaries   | trinidad |
[COLOR="Red"]| 11 | actuaries   | trinidad |[/COLOR]
| 12 | actuaries   | tobago   |
| 13 | agriculture | trinidad |
| 14 | agriculture | tobago   |
| etc...

[FONT=“Georgia”]Highlighted in red are what I’m trying to get rid of. Redundant data.

[/FONT]

r937 · April 22, 2010, 12:27pm

three steps:

CREATE TABLE my_temp
SELECT MIN(id) AS id, title, country FROM my_table
GROUP BY title, country

verify the results, then

TRUNCATE TABLE my_table

INSERT INTO my_table SELECT * FROM my_temp

then, to prevent the dupes from ever happening again,

ALTER TABLE my_table
ADD UNIQUE (title,country)

Vali · April 22, 2010, 8:10pm

I do this:


RENAME TABLE foo TO tmp;
CREATE TABLE foo LIKE tmp SELECT DISTINCT null,  title, country FROM tmp;
DROP TABLE tmp;

Just be careful, this will not keep the same IDs, but will be faster than the GROUP solution.

If you want to keep the IDs, I would do this:


RENAME TABLE foo TO tmp;
CREATE TABLE foo LIKE tmp;
ALTER TABLE foo ADD UNIQUE(title,country);
INSERT IGNORE INTO foo SELECT * FROM tmp ORDER BY id ASC;
DROP TABLE tmp;

But it’s slower.

Shaun · April 22, 2010, 9:58pm

[FONT=“Georgia”]Thanks Vali, but actually I was busy working on Rudy’s suggestion while you were typing yours.

I did a combination of Rudy’s query with PHPMyAdmin to create and trancate the tables.

I’m trying to figure out the last part though;[/FONT]

ALTER TABLE my_table
ADD UNIQUE (title,country)

[FONT=“Georgia”]What does that query do?

I googled it and it said something about adding a column to the table?

[/FONT]

SJH · April 22, 2010, 10:24pm

It creates a unique constraint on the columns title and country, meaning that each combination of these two fields can only exist in the table once

Shaun · April 22, 2010, 10:29pm

[FONT=“Georgia”]ah, Okay.

So how would that fit into my existing INSERT query?[/FONT]

INSERT INTO my_table VALUES ('','$title','$country','$other-values');

[FONT=“Georgia”]Or is it a query I’d run seperately, just after running INSERT?

[/FONT]

IBazz · April 22, 2010, 10:50pm

IN phpMyAdmin, go tio the mysl query window/box.

type in rudys alter table statement, making sure the name name of the table matches that of your db table. et voila The unique constraint will have been added and you won’t get any duplicates any more, where those two columns are concerned.

bazz

Shaun · April 22, 2010, 11:05pm

I got an error message;

#1071 - Specified key was too long; max key length is 1000 bytes

_

r937 · April 23, 2010, 12:13am

could you do a SHOW CREATE TABLE please

Shaun · April 23, 2010, 11:14am

#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘SHOW CREATE TABLE’ at line 4

_

SJH · April 23, 2010, 11:56am

You’ll need to follow those three words with the name of the table

Shaun · April 23, 2010, 12:07pm

[FONT=“Georgia”]Ore.

Okay it’s;[/FONT]

CREATE TABLE `yellow_categories` (
 `categories_id` int(5) NOT NULL DEFAULT '0',
 `categories_title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
 `categories_unique` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
 `categories_country` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
 `categories_letter` varchar(1) COLLATE utf8_unicode_ci NOT NULL,
 `categories_banner_id` int(5) NOT NULL,
 `categories_status` int(1) NOT NULL,
 `categories_log` text COLLATE utf8_unicode_ci NOT NULL,
 `categories_01` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
 `categories_02` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
 `categories_03` varchar(255) COLLATE utf8_unicode_ci NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

r937 · April 23, 2010, 4:10pm

something doesn’t add up

the ALTER to add a UNIQUE index on (title,country) gave you the error message " max key length is 1000 bytes "

however, your SHOW CREATE TABLE shows them as VARCHAR(255) which should only add up to 510

what gives, dude?

Shaun · April 23, 2010, 6:41pm

[FONT=“Georgia”]hahaha, Man if I knew that, I wouldn’t be here.

[/FONT]

Shaun · April 23, 2010, 6:51pm

[FONT=“Georgia”]I tried approaching the problem from another angle; At data entry rather than at database settings.

I truncated the whole damn table, then tried changing the input code in my PHP form to INSERT IGNORE.

That failed miserably; Instead of importing the thousands of record in my tab-delimited text file, it imported only one, with no data, and with an id (‘categories_id’) of 0.

And actually, I noticed when I scrapped that, removed the IGNORE, and re-imported the data, ALL of the id’s were 0!

It’s actually not a problem, because I’m referencing the records by ‘categories_unique’, but it’s puzzling.

I don’t know if both of those errors are related.

[/FONT]

r937 · April 23, 2010, 7:46pm

if you don’t need an auto_increment column, then for pity’s sake don’t declare one

Shaun · April 23, 2010, 7:55pm

[FONT=“Georgia”]Or else what?

It was being used at first, but the plan changed during building. The ‘categories_id’ column’s only there now because it would be more hassle to remove than keep it in.

[/FONT]

r937 · April 23, 2010, 7:57pm

which tables are related to this yellow categories table? (and why is it yellow?)

Shaun · April 25, 2010, 11:30pm

[FONT=“Georgia”]I’m still getting that “#1071 - Specified key was too long; max key length is 1000 bytes” error.

I made a file to clean redundant data after input, so I’ll just manually run that after data uploads (or redirect the input form to it).

In my next big database project I’ll spend sometime figuring out the ADD UNIQUE tools before beginning anything. This project isn’t paid well enough to fret over it now.

Thanks for all the help Rudy and everyone!

[/FONT]

Topic		Replies	Views
Delete Duplicate records in a table Databases	8	814	May 21, 2010
Selecting Just One of Duplicates Databases	2	372	September 12, 2010
Dublicate posts, how to remove 1 and leave one? Databases	2	494	March 6, 2011
Duplicate entry Databases	3	1023	September 14, 2015
Duplicate record Databases	5	555	March 17, 2010

Eliminating Duplicate Records

Related topics