Hello guys!
How do I remove duplicate data or rows from a MySQL query in the table with more than 10.000.000 rows?
Could you please help?
Thanks you very much for your help.
| SitePoint Sponsor |
Hello guys!
How do I remove duplicate data or rows from a MySQL query in the table with more than 10.000.000 rows?
Could you please help?
Thanks you very much for your help.
Thanks you very much for your help
I'm really happy for your quickly answer.
Good bye


how? with a healthy dose of patience
what part are you having problems with? identifying the duplicates, or removing them?
You have right my teacher...
I have first problem with identifying the duplicates because I use DISTINCT in my Select query but for the number of rows (more than 10.000.000 rows) I have this error:
Any suggestions?[Err] 2013 - Lost connection to MySQL server during query
Thank you


how do you know there are any duplicates at all?
thank you for reply.
I try this query:
And in the output I have duplicates rows...Code:SELECT Field1, Field2, Field3 ,......Field16 FROM tbl_t GROUP BY Field1, Field2, Field3 ,......Field16 HAVING COUNT(*) > 1;
Thanks you very much for your help
I'm really happy for your quickly answer.
Good bye


so you don't actually have a problem identifying the duplicates?
how many collumns in this table?
thank you.
My problems are:
- the number of rows in tbl_t over 10.000.000;
- delete the duplicate rows in tbl_t;
- the number of column in tbl_t are 16.
If try this example query I don't have problem:
The problem is when add all fields (number 16) of tbl_t in my query for find duplicates:Code:SELECT Field1 FROM tbl_t GROUP BY Field1 HAVING COUNT(*) > 1;
MySQL is crashing, what do I do?Code:SELECT Field1, Field2, Field3 ,......Field16 FROM tbl_t GROUP BY Field1, Field2, Field3 ,......Field16 HAVING COUNT(*) > 1;
Thanks you very much for your help
I'm really happy for your quickly answer.
Good bye


what is the primary key of your table?


i am sorry, something still is not right
you cannot possibly have duplicate rows in the table, not with a PRIMARY KEY
this is, i am sure, because you have used fake column names
something doesn't make sense here
It's true... I have used fake column names in my example to simplify:
For me these are duplicate rows even if `1_id` is different, all other data are the same:Code:SELECT Field1, Field2, Field3 ,......Field16 FROM tbl_t GROUP BY Field1, Field2, Field3 ,......Field16 HAVING COUNT(*) > 1;
Table structure:Code:+---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+ | 1_id |name | asciiname | alternatenames | latitude | longitude | feature class | feature code | country code | cc2 | admin1 | +---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+ | 3038838 | Costa Verda | Costa Verda | cv | 42.48333 | 1.66667 | T | SLP | AD | 1 | 00 | +---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+ | 3038839 | Costa Verda | Costa Verda | cv | 42.48333 | 1.66667 | T | SLP | AD | 1 | 00 | +---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+ continue... +---------+------------+-------------+----------------+-------------------+ | admin2 | population | elevation | timezone | modification date | +---------+------------+-------------+----------------+-------------------+ | 02 | 8839 | 369 | Costa Verda | 2010-12-22 | +---------+------------+-------------+----------------+-------------------+ | 02 | 8839 | 369 | Costa Verda | 2010-12-22 | +---------+------------+-------------+----------------+-------------------+
Code:DROP TABLE IF EXISTS `tbl_t`; CREATE TABLE `tbl_t` ( `1_id` int(10) NOT NULL AUTO_INCREMENT, `name` varchar(200) DEFAULT NULL, `asciiname` varchar(200) DEFAULT NULL, `alternatenames` varchar(5000) DEFAULT NULL, `latitude` char(200) DEFAULT NULL, `longitude` char(200) DEFAULT NULL, `feature class` char(100) DEFAULT NULL, `feature code` char(100) DEFAULT NULL, `country code` char(200) DEFAULT NULL, `cc2` char(60) DEFAULT NULL, `admin1` char(20) DEFAULT NULL, `admin2` char(80) DEFAULT NULL, `population` varchar(200) DEFAULT NULL, `elevation` varchar(200) DEFAULT NULL, `timezone` varchar(200) DEFAULT NULL, `modification date` date DEFAULT NULL, PRIMARY KEY (`1_id`), KEY `CODE` (`1_id`), KEY `COUNTRYCODE` (`country code`), KEY `asciiname` (`asciiname`) ) ENGINE=MyISAM AUTO_INCREMENT=7778661 DEFAULT CHARSET=latin1;
Thanks you very much for your help
I'm really happy for your quickly answer.
Good bye

Simply select all columns and group by all columns, and place a min() on the primary key ID, and remove that ID from the group clause. This will give you a unique list of data and the lowest ID.
<?php//Kyle Wolfeecho devBlog("My Dev Notes");


kyle, he's got unique primary keys, putting a MIN() on the primary key won't help
the duplicates now are given by "even if `1_id` is different, all other data are the same"
add a WHERE clause to restrict the duplicates to a range --this should stop the query from "crashing" but you'll have to do it multiple timesCode:SELECT id_1, Field2, Field3 ,......Field16 FROM tbl_t WHERE id_1 BETWEEN 13000 AND 17000 GROUP BY id_1, Field2, Field3 ,......Field16 HAVING COUNT(*) > 1;
thank you.
I try this query:The output is:Code:SELECT id_1, Field2, Field3 ,......Field16 FROM tbl_t WHERE id_1 BETWEEN 1 AND 10000 GROUP BY id_1, Field2, Field3 ,......Field16 HAVING COUNT(*) > 1;
this means that between the rows 1 and 10000 there are not duplicates ?Code:Affected rows: 0 Time: 1.176ms![]()
Thanks you very much for your help
I'm really happy for your quickly answer.
Good bye

CMS. Do you have other tables that use id_1 as a foreign key?
<?php//Kyle Wolfeecho devBlog("My Dev Notes");

My earlier post will suffice then, at least as a starting point. You don't even need to worry about bringing your id_1 over and just create a new one.
Code MySQL:--Make temp table CREATE TABLE `tempTable` ( `1_id` int(10) NOT NULL AUTO_INCREMENT, `name` varchar(200) DEFAULT NULL, `asciiname` varchar(200) DEFAULT NULL, `alternatenames` varchar(5000) DEFAULT NULL, `latitude` char(200) DEFAULT NULL, `longitude` char(200) DEFAULT NULL, `feature class` char(100) DEFAULT NULL, `feature code` char(100) DEFAULT NULL, `country code` char(200) DEFAULT NULL, `cc2` char(60) DEFAULT NULL, `admin1` char(20) DEFAULT NULL, `admin2` char(80) DEFAULT NULL, `population` varchar(200) DEFAULT NULL, `elevation` varchar(200) DEFAULT NULL, `timezone` varchar(200) DEFAULT NULL, `modification date` date DEFAULT NULL, PRIMARY KEY (`1_id`), KEY `CODE` (`1_id`), KEY `COUNTRYCODE` (`country code`), KEY `asciiname` (`asciiname`) ) ENGINE=MyISAM AUTO_INCREMENT=7778661 DEFAULT CHARSET=latin1; --Append unique data to new table, excluding 1_id from select as it is causing your 'dupes' INSERT INTO tempTable (name, asciiname, alternatenames, latitude, longitude, feature class, feature code, country code, cc2, admin1, admin2, population, elevation, timezone, modification date) SELECT name, asciiname, alternatenames, latitude, longitude, feature class, feature code, country code, cc2, admin1, admin2, population, elevation, timezone, modification date FROM tbl_t GROUP BY name, asciiname, alternatenames, latitude, longitude, feature class, feature code, country code, cc2, admin1, admin2, population, elevation, timezone, modification date
This is just a start.. You obviously have an insert query somewhere that was causing duplicates to be created, you'll have to find that and weed it out. Once you've created this temp table you may want to check for duplicate 'name' entries.
<?php//Kyle Wolfeecho devBlog("My Dev Notes");
thank you.
I try your suggestion but I have the crash of mysql:
[Err] 126 - Incorrect key file for table 'C:\WINDOWS\TEMP\#sql240_1e_9.MYI'; try to repair it
Thanks you very much for your help
I'm really happy for your quickly answer.
Good bye
Bookmarks