Duplicate Issue

Im not exactly sure how, but I’ve ended up with duplicates.

In the current table, you’ll see that I used chapt_id as the PK. This allowed for duplicate chapters in some movies.

CREATE TABLE IF NOT EXISTS `chapters` (
  `chapt_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `mov_id` bigint(20) unsigned NOT NULL DEFAULT '0',
  `chapter_num` varchar(3) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  `chapter_title` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  PRIMARY KEY (`chapt_id`),
  KEY `mov_id` (`mov_id`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ;

You can see here, if I export from the chapters table

INSERT INTO `chapters` VALUES
(1996, 65, '1', 'Main Titles'),
(1997, 65, '2', 'The Prisoners Arrive'),
(1998, 65, '3', 'Escape Attempt 1'),
(1999, 65, '4', 'The Blind Spot'),
(2000, 65, '5', 'To the Cooler'),
(2001, 65, '6', 'Bartlett''s Arrival'),
(2002, 65, '7', 'The Objective'),
(2003, 65, '8', 'The Escape Committee'),
(2004, 65, '9', 'Forger & Scrounger'),
(2005, 65, '10', 'Stealing Steel'),
(322864, 65, '1', 'Main Titles'),
(322865, 65, '2', 'The Prisoners Arrive'),
(322866, 65, '3', 'Escape Attempt 1'),
(322867, 65, '4', 'The Blind Spot'),
(322868, 65, '5', 'To the Cooler'),
(322869, 65, '6', 'Bartlett''s Arrival'),
(322870, 65, '7', 'The Objective'),
(322871, 65, '8', 'The Escape Committee'),
(322872, 65, '9', 'Forger & Scrounger'),
(322873, 65, '10', 'Stealing Steel');

I’m trying to take the data from the current “chapters” table and INSERT it into the new “movies_chapters” table. Obviously, looking at the PKs below you can see that I’m running into duplicate errors.

CREATE TABLE IF NOT EXISTS `movies_chapters` (
  `mc_mov_id` int(10) unsigned NOT NULL DEFAULT '0',
  `mc_lang` char(2) CHARACTER SET latin1 NOT NULL,
  `mc_chapter_num` varchar(3) CHARACTER SET latin1 NOT NULL,
  `mc_chapter_title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  PRIMARY KEY (`mc_mov_id`,`mc_chapter_num`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

I’ll be using the mov_id and the chapter_num as the new PK, so I’m not worried about keeping the chapt_id anymore. What’s the best way to handle this? Delete the duplicates from the chapters table? If so, how?

Thanks!

Do any other tables in the database use the PK from the chapters table is their FK?

Nope.

see if you can spot the difference in these two solutions –

INSERT 
  INTO movies_chapters
     ( mc_mov_id
     , mc_lang
     , mc_chapter_num
     , mc_chapter_title )
SELECT DISTINCT
       mov_id
     , 'en'
     , chapter_num
     , chapter_title  
  FROM chapters
INSERT IGNORE
  INTO movies_chapters
     ( mc_mov_id
     , mc_lang
     , mc_chapter_num
     , mc_chapter_title )
SELECT mov_id
     , 'en'
     , chapter_num
     , chapter_title  
  FROM chapters

The syntax is different, but I’m going to say those do the same thing. Does INSERT INGORE use less system resources?

no, INSERT IGNORE attempts to insert all rows, ignoring dupes, whereas the INSERT which has SELECT DISTINCT inserts only the uniques

system resources is of minor concern here, since you are presumably doing this only once

@Rudy, for the benefit of anyone who comes across this thread what performance difference, if any is there between the two solutions?

i would guess that the one with DISTINCT might be slower, because it has to sort the entire table on all columns first, whereas the IGNORE only has to do an index lookup on each row

who do you suppose has some sample data in the requisite two table layouts and might be available to test it for us?

I tried doing both and received this error performing the DUPLICATE query: #2006 - MySQL server has gone away. The chapters table has over 600,000 rows.

Rudy, I understand that the actions are different, but isn’t the outcome still the same? Which method do you recommend?

i would not agonize over which method to use

if one of them failed on me, i’d use the other one in a flash

Excellent, so they DO produce the same result.

I’m working on this on my local Mac, not a web server. It’s a Mac Pro 8 Core with 6 GB RAM. My query results are

Using INSERT IGNORE
Inserted rows: 579653 (Query took 47.4497 sec)

Using DUPLICATE
#2006 - MySQL server has gone away.

I tried this three times and got the same result. Use INSERT IGNORE. :slight_smile:

Thanks Rudy!

Not necessarily. You might have multiple records with the same mov_id and chapter_num, but have different titles. Distinct sees those as distinct rows. An error would be thrown when you try to insert them because your new table only has a unique constraint on mov_id and chapter_num.

This probably isn’t the case in your dataset though.