Help with VIEW that locates duplicates in one column - SQLServer

LesMizzell · November 16, 2014, 5:27am

I’m trying to generate a list of " nl_mailgroups.ml_id" that contains duplicate email addresses ( nl_mailgroups.ml_email) that belong to a specific group ( nl_catREL.groups_id).

Got no problem getting the view to return all records in the group - but getting it to return JUST the duplicates is giving me a headache. I’ve Googled like a demon, and tried a ton of stuff, but can’t seem to get it to work just right. Ack.

Current view below. What do I have wrong?

SELECT
nl_mailgroups.ml_id,
nl_mailgroups.ml_email,
nl_catREL.groups_id
FROM nl_catREL
INNER JOIN
nl_mailgroups ON nl_catREL.ml_id = nl_mailgroups.ml_id
WHERE
(nl_catREL.groups_id = 1100)
AND EXISTS
(SELECT nl_mailgroups.ml_email, COUNT() AS RecordCnt
FROM nl_mailgroups AS nl_mailgroups_1
GROUP BY ml_email
HAVING (COUNT() > 1))

r937 · November 16, 2014, 10:16am

SELECT nl_catREL.groups_id , nl_mailgroups.ml_id , nl_mailgroups.ml_email FROM nl_catREL INNER JOIN nl_mailgroups ON nl_mailgroups.ml_id = nl_catREL.ml_id WHERE nl_catREL.groups_id = 1100 GROUP BY nl_catREL.groups_id , nl_mailgroups.ml_id , nl_mailgroups.ml_email HAVING COUNT(*) > 1

LesMizzell · November 16, 2014, 2:15pm

r937 - returns no results. If nl_mailgroups.ml_id is included in the GROUP BY, isn’t it looking for duplicate values there? Those are all unique with no duplicates.

Given the data below, it should return ml_id 26 and 29

groups_id    ml_id    ml_email
1                23        email-10@somewhere.com
1                24        email-12@somewhere.com
1                25        email-14@somewhere.com
1                26        email-10@somewhere.com
1                27        email-16@somewhere.com
1                28        email-19@somewhere.com
1                29        email-10@somewhere.com

If I remove ml_id from the group by, I get the dreaded “not included in an aggregate” error…
If I remove the ml_id complete from the entire query, it works:

SELECT     nl_catREL.groups_id, nl_mailgroups.ml_email
FROM        nl_catREL INNER JOIN
                   nl_mailgroups ON nl_mailgroups.ml_id = nl_catREL.ml_id
WHERE     (nl_catREL.groups_id = 1100)
GROUP BY nl_catREL.groups_id, nl_mailgroups.ml_email
HAVING      (COUNT(*) > 1)

…but ml_id is the one value I need returned - and that’s what’s been giving me the headache to start with!

Thanks …

r937 · November 16, 2014, 2:36pm

oops, try this –

SELECT nl_catREL.groups_id , nl_mailgroups.ml_id FROM nl_catREL INNER JOIN nl_mailgroups ON nl_mailgroups.ml_id = nl_catREL.ml_id WHERE nl_catREL.groups_id = 1100 GROUP BY nl_catREL.groups_id , nl_mailgroups.ml_id HAVING COUNT(nl_mailgroups.ml_email) > 1

LesMizzell · November 16, 2014, 3:04pm

Still no results returned. I don’t see how to do this without a sub query myself.
The below will return ALL duplicates in the nl_mailgroups:

SELECT ml_email, COUNT(1) AS DuplicateCount
FROM [wingard-sa].nl_mailgroups
GROUP BY ml_email
HAVING (COUNT(1) > 1)

Now if I can return the ml_ID while constraining that result to the one groups_id from the nl_catREL table.

Seems like something like the below should work, but it times out, so something isn’t quite correct ( there’s only 150 records in the group in question - so I’m really in trouble when I pick a group with 30,000 records…)

SELECT nl_catREL.groups_id,nl_mailgroups.ml_email,nl_mailgroups.ml_id
FROM nl_catREL 
INNER JOIN  nl_mailgroups ON nl_mailgroups.ml_id =nl_catREL.ml_id
WHERE (nl_catREL.groups_id = 1100)
       // if I stop right here, it will return all the records for the one group
       // does it pretty darned quick too, so the problem is below*
AND EXISTS  ( SELECT ml_email, COUNT(1) AS DuplicateCount
              FROM  nl_mailgroups AS nl_mailgroups_1
              GROUP BY ml_email
              HAVING (COUNT(1) > 1))

r937 · November 16, 2014, 4:52pm

[quote=“LesMizzell, post:5, topic:104167, full:true”]
Still no results returned. [/quote]
well, that means that ~no~ nl_mailgroups.ml_id values have more than one nl_mailgroups.ml_email value

at least not in the table you ran my query on

LesMizzell · November 16, 2014, 5:04pm

If I run the simple query below and return the entire group (which I entered specifically for testing) - there’s 3 of one address and 2 of another.

SELECT nl_catREL.groups_id, nl_mailgroups.ml_email, nl_mailgroups.ml_id
FROM nl_catREL INNER JOIN
     nl_mailgroups ON nl_mailgroups.ml_id = nl_catREL.ml_id
WHERE (nl_catREL.groups_id = 1100)
ORDER BY nl_mailgroups.ml_email

Returns:

...
1100    foo@xox.com    379219
1100    foo@xox.com    379220
1100    foo@xox.com    379221
...
1100    foo@foo.net    379202
1100    foo@foo.net    379203
...

r937 · November 16, 2014, 6:34pm

[quote=“LesMizzell, post:7, topic:104167, full:true”]Returns:
…
1100 foo@xox.com 379219
1100 foo@xox.com 379220
1100 foo@xox.com 379221
…
1100 foo@foo.net 379202
1100 foo@foo.net 379203
…
[/quote]

would you be kind enough to run the following query on the same table

SELECT nl_catREL.groups_id , nl_mailgroups.ml_id , COUNT(nl_mailgroups.ml_email) AS emails FROM nl_catREL INNER JOIN nl_mailgroups ON nl_mailgroups.ml_id = nl_catREL.ml_id WHERE nl_catREL.groups_id = 1100 GROUP BY nl_catREL.groups_id , nl_mailgroups.ml_id ORDER BY emails DESC

LesMizzell · November 16, 2014, 7:09pm

would you be kind enough to run the following query on the same table

Returns all rows for records belonging to groups_id 1100 (161 records) like:

    ...
    1100    379218    1
    1100    379219    1
    1100    379220    1
    1100    379221    1
    1100    379222    1
    1100    379223    1
    1100    379224    1
    1100    379225    1
    1100    379226    1
    ....

r937 · November 16, 2014, 7:58pm

thank you

this proves that one of us misunderstands the requirement

you wanted “a list of nl_mailgroups.ml_id that contains duplicate email addresses ( nl_mailgroups.ml_email)”

i demonstrated that ~no~ nl_mailgroups.ml_id have duplicated emails

do you see where we differ?

LesMizzell · November 16, 2014, 8:52pm

Ahhh … my definition of what I needed was off. My apologies. How about:

Return a list the the unique id numbers (ml_id) of all duplicated email addresses (ml_email) belonging to a specific group id (groups_id).

Which for group 1100, should be ml_id:
379220
379221
379203
( returning only the ml_id for duplicate ml_email values, not the originals - which are ml_id - 379219 and 379202 )

My obvious problem that I’ve apparently not gotten my head around yet - returning multiple records/columns where only ONE column has duplicates but the rest are unique. I was previously doing this in Coldfusion using a loop - until the client tossed 30,000 or so records at it at once. So, I’m trying to let the database do the heavy lifting now. Makes my head hurt.

Thank you for your assistance up to this point …

r937 · November 16, 2014, 9:09pm

SELECT nl_catREL.groups_id , duplicates.ml_id FROM nl_catREL INNER JOIN nl_mailgroups ON nl_mailgroups.ml_id = nl_catREL.ml_id INNER JOIN nl_mailgroups AS duplicates ON duplicates.ml_email = nl_mailgroups.ml_email AND duplicates.ml_id > nl_mailgroups.ml_id WHERE nl_catREL.groups_id = 1100

LesMizzell · November 16, 2014, 9:23pm

That one returns 511 records - everything multiple times, including the ml_id of the original and duplicated ml_email.
Interesting …

r937 · November 16, 2014, 9:34pm

okay, finally i did some testing

(aside: you might consider on your next problem to post a test table)

please put the word DISTINCT after the word SELECT

LesMizzell · November 16, 2014, 9:44pm

411 records returned … and the ml_id numbers just keep increasing, not repeated…

First and last few returned:

1100    379203
1100    379220
1100    379221
...
1100    379834
1100    379835
1100    379836

Definitely

r937 · November 16, 2014, 9:53pm

you said “there’s only 150 records in the group”

any chance you could post your test data now?

LesMizzell · November 16, 2014, 10:43pm

Guess I should have said “approximately” 150 records. I just created a new group with EXACTLY 150 reocrds.

To double check. the below returns 150 records:

SELECT nl_catREL.groups_id,nl_mailgroups.ml_id,nl_mailgroups.ml_email
FROM nl_catREL INNER JOIN
     nl_mailgroups ONnl_catREL.ml_id =nl_mailgroups.ml_id
WHERE (nl_catREL.groups_id = 1104)

Your last query, adjusted for the new group id, returns exactly what it should now, which is making me wonder why it didn’t work correctly before.

    SELECT DISTINCT nl_catREL.groups_id, duplicates.ml_id
    FROM  nl_catREL 
    INNER JOIN
       nl_mailgroups ON nl_mailgroups.ml_id = nl_catREL.ml_id
    INNER JOIN
       nl_mailgroups AS duplicates ON duplicates.ml_email = nl_mailgroups.ml_email 
     AND   duplicates.ml_id > nl_mailgroups.ml_id
     WHERE (nl_catREL.groups_id = 1104)

The duplicates are returned bow. The new group had one additional one, noted in the return as well.

1104    379851    somebody@bellsouth.net
1104    379868    foo@xox.com
1104    379869    foo@xox.com
1104    379987    bob.somebody@chaptertwo.net

Question - why not a sub query using “WHERE EXISTS”? Most of the examples I was finding went in that direction and that’s what I’ve been attempting for way too many hours with awful results.

I have some more testing to do with different groups and such, but I think you’ve got it!
This is part of a larger application that imports records from an Excel sheet written in Coldfusion. It was originally using an older POI utility (well over 2000 lines of code), and was choking when the client tried to import very large Excel sheet with 10,000 or more records. It’s lightening fast compared to what it was.

This finding duplicates part was one of the main bottlenecks.

You’ve no idea how much I appreciate your help.

r937 · November 16, 2014, 11:28pm

i’m glad it’s resolved

r937 · November 16, 2014, 11:30pm

p.s. i think it’s cute how you keep putting those unnecessary parentheses back into the WHERE clause after i take them out

LesMizzell · November 16, 2014, 11:45pm

Actually, that’s the database doing that. I’m pasting the code into SQL Manager, and when I run the query, that’s what it does to it.

The real test of this was one large list that I know is in there with a little over 15,000 records. Doing that one outside the database was just crazy - 6 or 7 minutes to run. This is taking under 10 seconds with the same list.

Thanks!

Topic		Replies	Views
List all duplicate values Databases	4	486	October 7, 2010
Find records which have matching column values Databases	5	520	September 24, 2014
Mysql distinct Databases	8	470	July 30, 2010
Return duplicates Databases	8	641	September 24, 2014
Show me the duplicates Databases	9	733	October 8, 2014

Help with VIEW that locates duplicates in one column - SQLServer

Related topics