Control selection order to dedupe then insert

nichemtktg · September 27, 2015, 9:50pm

Here’s my code:
INSERT final_deduped2
SELECT * FROM (
SELECT *
FROM final_deduped
ORDER BY id DESC
) AS tmp_table GROUP BY address

I wrote it to insert the address with the highest id (no dupes) into the table called final_deduped2.
Also, I’m under the impression that ORDER BY and GROUP BY can’t be used in the same SELECT.

I’ve checked my results manually and believe my script did what I need it to do, but there are now 70K+ rows in final_deduped2.

Can you see bad practices or any reason to think that my satisfaction is just a coincidence?

r937 · September 28, 2015, 9:00am

any time you use the dreaded evil “select star” along with GROUP BY, you’re asking for heartache

you can definitely use GROUP BY and ORDER BY in the same query, but only when the ORDER BY is used to sort the results of the groupings – using ORDER BY “before” grouping like you’ve tried to do doesn’t work

if you want the highest id, then you’ll have to use MAX(id) in your grouping

nichemtktg · September 29, 2015, 11:04am

Then, what’s the difference between ORDER BY id DESC and ORDER BY MAX(id)?

r937 · September 29, 2015, 11:29am

INSERT INTO final_deduped2 SELECT t.* FROM ( SELECT address , MAX(id) AS max_id FROM final_deduped GROUP BY address ) AS addr INNER JOIN final_deduped AS t ON t.address = addr.address AND t.id = addr.max_id
run the inner query by itself to see and confirm what it produces

then run the outer SELECT by itself to see the effect of joining the original table to the subquery results (also called a derived table) based on the two join conditions

then stick your INSERT in front when you’re ready

nichemtktg · September 29, 2015, 12:29pm

The inner query, outer select, and insert produce the same results based on this data set.
Is that the point? Or, is there somekind of randomness that can occur in large data sets?

id address
1 MAIN ST
2 OAK ST
3 OAK ST
4 SOUTH ST
5 SOUTH ST
6 SOUTH ST
7 NORTH ST
8 NORTH ST
9 NORTH ST
10 NORTH ST

r937 · September 29, 2015, 1:41pm

not true

easily disproven with this sample test script –

[code]CREATE TABLE nichemtktg
( id INTEGER
, address VARCHAR(9)
, otherdata CHAR(2)
);
INSERT INTO nichemtktg VALUES
( 1 , ‘MAIN ST’ , ‘QP’ )
,( 2 , ‘OAK ST’ , ‘WO’ )
,( 3 , ‘OAK ST’ , ‘EI’ )
,( 4 , ‘SOUTH ST’ , ‘RU’ )
,( 5 , ‘SOUTH ST’ , ‘TY’ )
,( 6 , ‘SOUTH ST’ , ‘YT’ )
,( 7 , ‘NORTH ST’ , ‘UR’ )
,( 8 , ‘NORTH ST’ , ‘IE’ )
,( 9 , ‘NORTH ST’ , ‘OW’ )
,( 10 , ‘NORTH ST’ , ‘PQ’ )
;

SELECT address
, MAX(id) AS max_id
FROM nichemtktg
GROUP
BY address

/*
address max_id
MAIN ST 1
NORTH ST 10
OAK ST 3
SOUTH ST 6
*/

SELECT t.*
FROM ( SELECT address
, MAX(id) AS max_id
FROM nichemtktg
GROUP
BY address ) AS addr
INNER
JOIN nichemtktg AS t
ON t.address = addr.address
AND t.id = addr.max_id

/*
id address otherdata
1 MAIN ST QP
3 OAK ST EI
6 SOUTH ST YT
10 NORTH ST PQ
*/
[/code]
so your situation must have some other complexity which is not readily apparent

nichemtktg · September 29, 2015, 1:53pm

So, the point is that the order of insertion is different. Correct?

r937 · September 29, 2015, 4:03pm

no, the order of insertion doesn’t matter in the slightest

the point is, my query results in only 4 rows being inserted, not 10 as you said you got with that particular sample

nichemtktg · September 29, 2015, 5:36pm

Sorry for the misunderstanding. I meant all the the results were the same in all three scripts:
1 MAIN ST
10 NORTH ST
3 OAK ST
6 SOUTH ST
I should’ve specified the results. I only specified the dataset.

That’s also the same results from applying the script from the OP to the dataset.

Were you expecting different results?

r937 · September 29, 2015, 6:11pm

did you have any other columns in the fnal_deduped table besides id and address?

if so, my approach is guaranteed to give you the correct columns that belong to the row with the max id, while yours is indeterminate (i.e wouldn’t necessarily match ‘PQ’ with 10 for NORTH ST)

nichemtktg · September 29, 2015, 7:02pm

I got that from your posts.

Is part of the lesson about aggregated columns?

https://dev.mysql.com/doc/refman/5.0/en/group-by-handling.html

r937 · September 29, 2015, 10:08pm

yes!!

if you’ve read through the mysql explanation, read this one too – http://rpbouman.blogspot.ca/2007/05/debunking-group-by-myths.html

system · December 30, 2015, 5:08am

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.