SitePoint Sponsor

User Tag List

Results 1 to 18 of 18
  1. #1
    SitePoint Evangelist cms9651's Avatar
    Join Date
    Mar 2010
    Posts
    407
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)

    [MySQL 5.0.45-community-nt-log] Remove Duplicate Data or Rows

    Hello guys!

    How do I remove duplicate data or rows from a MySQL query in the table with more than 10.000.000 rows?

    Could you please help?
    Thanks you very much for your help.
    Thanks you very much for your help
    I'm really happy for your quickly answer.
    Good bye

  2. #2
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,014
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    how? with a healthy dose of patience

    what part are you having problems with? identifying the duplicates, or removing them?
    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  3. #3
    SitePoint Evangelist cms9651's Avatar
    Join Date
    Mar 2010
    Posts
    407
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    how? with a healthy dose of patience
    You have right my teacher...

    Quote Originally Posted by r937 View Post
    what part are you having problems with? identifying the duplicates, or removing them?
    I have first problem with identifying the duplicates because I use DISTINCT in my Select query but for the number of rows (more than 10.000.000 rows) I have this error:
    [Err] 2013 - Lost connection to MySQL server during query
    Any suggestions?
    Thank you

  4. #4
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,014
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    how do you know there are any duplicates at all?
    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  5. #5
    SitePoint Evangelist cms9651's Avatar
    Join Date
    Mar 2010
    Posts
    407
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    thank you for reply.

    I try this query:
    Code:
    SELECT
    	Field1,
    	Field2,
    	Field3 ,......Field16
    FROM
    	tbl_t
    GROUP BY
    	Field1,
    	Field2,
    	Field3 ,......Field16
    HAVING
    	COUNT(*) > 1;
    And in the output I have duplicates rows...
    Thanks you very much for your help
    I'm really happy for your quickly answer.
    Good bye

  6. #6
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,014
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    so you don't actually have a problem identifying the duplicates?

    how many collumns in this table?
    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  7. #7
    SitePoint Evangelist cms9651's Avatar
    Join Date
    Mar 2010
    Posts
    407
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    thank you.

    My problems are:
    • the number of rows in tbl_t over 10.000.000;
    • delete the duplicate rows in tbl_t;
    • the number of column in tbl_t are 16.

    If try this example query I don't have problem:
    Code:
    SELECT
    	Field1
    FROM
    	tbl_t
    GROUP BY
    	Field1
    HAVING
    	COUNT(*) > 1;
    The problem is when add all fields (number 16) of tbl_t in my query for find duplicates:
    Code:
    SELECT
    	Field1,
    	Field2,
    	Field3 ,......Field16
    FROM
    	tbl_t
    GROUP BY
    	Field1,
    	Field2,
    	Field3 ,......Field16
    HAVING
    	COUNT(*) > 1;
    MySQL is crashing, what do I do?
    Thanks you very much for your help
    I'm really happy for your quickly answer.
    Good bye

  8. #8
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,014
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    what is the primary key of your table?
    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  9. #9
    SitePoint Evangelist cms9651's Avatar
    Join Date
    Mar 2010
    Posts
    407
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    what is the primary key of your table?
    Field `1_ID` int(10) NOT NULL AUTO_INCREMENT, PRIMARY KEY (`1_ID`)
    Thanks you very much for your help
    I'm really happy for your quickly answer.
    Good bye

  10. #10
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,014
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    i am sorry, something still is not right

    you cannot possibly have duplicate rows in the table, not with a PRIMARY KEY

    this is, i am sure, because you have used fake column names

    something doesn't make sense here
    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  11. #11
    SitePoint Evangelist cms9651's Avatar
    Join Date
    Mar 2010
    Posts
    407
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    It's true... I have used fake column names in my example to simplify:
    Code:
    SELECT
    	Field1,
    	Field2,
    	Field3 ,......Field16
    FROM
    	tbl_t
    GROUP BY
    	Field1,
    	Field2,
    	Field3 ,......Field16
    HAVING
    	COUNT(*) > 1;
    For me these are duplicate rows even if `1_id` is different, all other data are the same:
    Code:
    +---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+
    | 1_id    |name 	| asciiname   | alternatenames | latitude | longitude | feature class | feature code | country code | cc2 | admin1 |
    +---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+
    | 3038838 | Costa Verda | Costa Verda | cv             | 42.48333 | 1.66667   | T             | SLP          | AD           |   1 | 00     |
    +---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+
    | 3038839 | Costa Verda | Costa Verda | cv             | 42.48333 | 1.66667   | T             | SLP          | AD           |   1 | 00     |
    +---------+-------------+-------------+----------------+----------+-----------+---------------+--------------+--------------+-----+--------+
    
    continue...
    +---------+------------+-------------+----------------+-------------------+
    | admin2  | population | elevation   | timezone       | modification date |
    +---------+------------+-------------+----------------+-------------------+
    | 02      | 8839       | 369         | Costa Verda    | 2010-12-22        |
    +---------+------------+-------------+----------------+-------------------+
    | 02      | 8839       | 369         | Costa Verda    | 2010-12-22        |
    +---------+------------+-------------+----------------+-------------------+
    Table structure:
    Code:
    DROP TABLE IF EXISTS `tbl_t`;
    CREATE TABLE `tbl_t` (
      `1_id` int(10) NOT NULL AUTO_INCREMENT,
      `name` varchar(200) DEFAULT NULL,
      `asciiname` varchar(200) DEFAULT NULL,
      `alternatenames` varchar(5000) DEFAULT NULL,
      `latitude` char(200) DEFAULT NULL,
      `longitude` char(200) DEFAULT NULL,
      `feature class` char(100) DEFAULT NULL,
      `feature code` char(100) DEFAULT NULL,
      `country code` char(200) DEFAULT NULL,
      `cc2` char(60) DEFAULT NULL,
      `admin1` char(20) DEFAULT NULL,
      `admin2` char(80) DEFAULT NULL,
      `population` varchar(200) DEFAULT NULL,
      `elevation` varchar(200) DEFAULT NULL,
      `timezone` varchar(200) DEFAULT NULL,
      `modification date` date DEFAULT NULL,
      PRIMARY KEY (`1_id`),
      KEY `CODE` (`1_id`),
      KEY `COUNTRYCODE` (`country code`),
      KEY `asciiname` (`asciiname`)
    ) ENGINE=MyISAM AUTO_INCREMENT=7778661 DEFAULT CHARSET=latin1;
    Thanks you very much for your help
    I'm really happy for your quickly answer.
    Good bye

  12. #12
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,075
    Mentioned
    53 Post(s)
    Tagged
    0 Thread(s)
    Simply select all columns and group by all columns, and place a min() on the primary key ID, and remove that ID from the group clause. This will give you a unique list of data and the lowest ID.

  13. #13
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,014
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    kyle, he's got unique primary keys, putting a MIN() on the primary key won't help

    the duplicates now are given by "even if `1_id` is different, all other data are the same"

    add a WHERE clause to restrict the duplicates to a range --
    Code:
    SELECT
    	id_1,
    	Field2,
    	Field3 ,......Field16
    FROM
    	tbl_t
     WHERE id_1 BETWEEN 13000 AND 17000
    GROUP BY
    	id_1,
    	Field2,
    	Field3 ,......Field16
    HAVING
    	COUNT(*) > 1;
    this should stop the query from "crashing" but you'll have to do it multiple times
    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  14. #14
    SitePoint Evangelist cms9651's Avatar
    Join Date
    Mar 2010
    Posts
    407
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    thank you.

    I try this query:
    Code:
    SELECT
    	id_1,
    	Field2,
    	Field3 ,......Field16
    FROM
    	tbl_t
     WHERE id_1 BETWEEN 1 AND 10000
    GROUP BY
    	id_1,
    	Field2,
    	Field3 ,......Field16
    HAVING
    	COUNT(*) > 1;
    The output is:
    Code:
    Affected rows: 0
    Time: 1.176ms
    this means that between the rows 1 and 10000 there are not duplicates ?
    Thanks you very much for your help
    I'm really happy for your quickly answer.
    Good bye

  15. #15
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,075
    Mentioned
    53 Post(s)
    Tagged
    0 Thread(s)
    CMS. Do you have other tables that use id_1 as a foreign key?

  16. #16
    SitePoint Evangelist cms9651's Avatar
    Join Date
    Mar 2010
    Posts
    407
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by K. Wolfe View Post
    CMS. Do you have other tables that use id_1 as a foreign key?
    Not, I don't have.
    Thanks you very much for your help
    I'm really happy for your quickly answer.
    Good bye

  17. #17
    Always A Novice bronze trophy
    K. Wolfe's Avatar
    Join Date
    Nov 2003
    Location
    Columbus, OH
    Posts
    2,075
    Mentioned
    53 Post(s)
    Tagged
    0 Thread(s)
    My earlier post will suffice then, at least as a starting point. You don't even need to worry about bringing your id_1 over and just create a new one.

    Code MySQL:
    --Make temp table
    CREATE TABLE `tempTable` (
      `1_id` int(10) NOT NULL AUTO_INCREMENT,
      `name` varchar(200) DEFAULT NULL,
      `asciiname` varchar(200) DEFAULT NULL,
      `alternatenames` varchar(5000) DEFAULT NULL,
      `latitude` char(200) DEFAULT NULL,
      `longitude` char(200) DEFAULT NULL,
      `feature class` char(100) DEFAULT NULL,
      `feature code` char(100) DEFAULT NULL,
      `country code` char(200) DEFAULT NULL,
      `cc2` char(60) DEFAULT NULL,
      `admin1` char(20) DEFAULT NULL,
      `admin2` char(80) DEFAULT NULL,
      `population` varchar(200) DEFAULT NULL,
      `elevation` varchar(200) DEFAULT NULL,
      `timezone` varchar(200) DEFAULT NULL,
      `modification date` date DEFAULT NULL,
      PRIMARY KEY (`1_id`),
      KEY `CODE` (`1_id`),
      KEY `COUNTRYCODE` (`country code`),
      KEY `asciiname` (`asciiname`)
    ) ENGINE=MyISAM AUTO_INCREMENT=7778661 DEFAULT CHARSET=latin1;
    --Append unique data to new table, excluding 1_id from select as it is causing your 'dupes'
    INSERT INTO tempTable (name, asciiname, alternatenames, latitude, longitude, feature class, feature code, country code, cc2, admin1, admin2, population, elevation, timezone, modification date)
    SELECT
      name,
      asciiname,
      alternatenames,
      latitude,
      longitude,
      feature class,
      feature code,
      country code,
      cc2,
      admin1,
      admin2,
      population,
      elevation,
      timezone,
      modification date
    FROM
      tbl_t
    GROUP BY  
      name,
      asciiname,
      alternatenames,
      latitude,
      longitude,
      feature class,
      feature code,
      country code,
      cc2,
      admin1,
      admin2,
      population,
      elevation,
      timezone,
      modification date

    This is just a start.. You obviously have an insert query somewhere that was causing duplicates to be created, you'll have to find that and weed it out. Once you've created this temp table you may want to check for duplicate 'name' entries.

  18. #18
    SitePoint Evangelist cms9651's Avatar
    Join Date
    Mar 2010
    Posts
    407
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    thank you.
    I try your suggestion but I have the crash of mysql:
    [Err] 126 - Incorrect key file for table 'C:\WINDOWS\TEMP\#sql240_1e_9.MYI'; try to repair it
    Thanks you very much for your help
    I'm really happy for your quickly answer.
    Good bye


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •