I’m trying to generate a list of " nl_mailgroups.ml_id" that contains duplicate email addresses ( nl_mailgroups.ml_email) that belong to a specific group ( nl_catREL.groups_id).
Got no problem getting the view to return all records in the group - but getting it to return JUST the duplicates is giving me a headache. I’ve Googled like a demon, and tried a ton of stuff, but can’t seem to get it to work just right. Ack.
Current view below. What do I have wrong?
SELECT
nl_mailgroups.ml_id,
nl_mailgroups.ml_email,
nl_catREL.groups_id
FROM nl_catREL
INNER JOIN
nl_mailgroups ON nl_catREL.ml_id = nl_mailgroups.ml_id
WHERE
(nl_catREL.groups_id = 1100)
AND EXISTS
(SELECT nl_mailgroups.ml_email, COUNT() AS RecordCnt
FROM nl_mailgroups AS nl_mailgroups_1
GROUP BY ml_email
HAVING (COUNT() > 1))
SELECT nl_catREL.groups_id
, nl_mailgroups.ml_id
, nl_mailgroups.ml_email
FROM nl_catREL
INNER
JOIN nl_mailgroups
ON nl_mailgroups.ml_id = nl_catREL.ml_id
WHERE nl_catREL.groups_id = 1100
GROUP
BY nl_catREL.groups_id
, nl_mailgroups.ml_id
, nl_mailgroups.ml_email
HAVING COUNT(*) > 1
r937 - returns no results. If nl_mailgroups.ml_id is included in the GROUP BY, isn’t it looking for duplicate values there? Those are all unique with no duplicates.
Given the data below, it should return ml_id 26 and 29
If I remove ml_id from the group by, I get the dreaded “not included in an aggregate” error…
If I remove the ml_id complete from the entire query, it works:
SELECT nl_catREL.groups_id, nl_mailgroups.ml_email
FROM nl_catREL INNER JOIN
nl_mailgroups ON nl_mailgroups.ml_id = nl_catREL.ml_id
WHERE (nl_catREL.groups_id = 1100)
GROUP BY nl_catREL.groups_id, nl_mailgroups.ml_email
HAVING (COUNT(*) > 1)
…but ml_id is the one value I need returned - and that’s what’s been giving me the headache to start with!
SELECT nl_catREL.groups_id
, nl_mailgroups.ml_id
FROM nl_catREL
INNER
JOIN nl_mailgroups
ON nl_mailgroups.ml_id = nl_catREL.ml_id
WHERE nl_catREL.groups_id = 1100
GROUP
BY nl_catREL.groups_id
, nl_mailgroups.ml_id
HAVING COUNT(nl_mailgroups.ml_email) > 1
Still no results returned. I don’t see how to do this without a sub query myself.
The below will return ALL duplicates in the nl_mailgroups:
SELECT ml_email, COUNT(1) AS DuplicateCount
FROM [wingard-sa].nl_mailgroups
GROUP BY ml_email
HAVING (COUNT(1) > 1)
Now if I can return the ml_ID while constraining that result to the one groups_id from the nl_catREL table.
Seems like something like the below should work, but it times out, so something isn’t quite correct ( there’s only 150 records in the group in question - so I’m really in trouble when I pick a group with 30,000 records…)
SELECT nl_catREL.groups_id,nl_mailgroups.ml_email,nl_mailgroups.ml_id
FROM nl_catREL
INNER JOIN nl_mailgroups ON nl_mailgroups.ml_id =nl_catREL.ml_id
WHERE (nl_catREL.groups_id = 1100)
// if I stop right here, it will return all the records for the one group
// does it pretty darned quick too, so the problem is below*
AND EXISTS ( SELECT ml_email, COUNT(1) AS DuplicateCount
FROM nl_mailgroups AS nl_mailgroups_1
GROUP BY ml_email
HAVING (COUNT(1) > 1))
[quote=“LesMizzell, post:5, topic:104167, full:true”]
Still no results returned. [/quote]
well, that means that ~no~ nl_mailgroups.ml_id values have more than one nl_mailgroups.ml_email value
If I run the simple query below and return the entire group (which I entered specifically for testing) - there’s 3 of one address and 2 of another.
SELECT nl_catREL.groups_id, nl_mailgroups.ml_email, nl_mailgroups.ml_id
FROM nl_catREL INNER JOIN
nl_mailgroups ON nl_mailgroups.ml_id = nl_catREL.ml_id
WHERE (nl_catREL.groups_id = 1100)
ORDER BY nl_mailgroups.ml_email
would you be kind enough to run the following query on the same table
SELECT nl_catREL.groups_id
, nl_mailgroups.ml_id
, COUNT(nl_mailgroups.ml_email) AS emails
FROM nl_catREL
INNER
JOIN nl_mailgroups
ON nl_mailgroups.ml_id = nl_catREL.ml_id
WHERE nl_catREL.groups_id = 1100
GROUP
BY nl_catREL.groups_id
, nl_mailgroups.ml_id
ORDER
BY emails DESC
Ahhh … my definition of what I needed was off. My apologies. How about:
Return a list the the unique id numbers (ml_id) of all duplicated email addresses (ml_email) belonging to a specific group id (groups_id).
Which for group 1100, should be ml_id:
379220
379221
379203
( returning only the ml_id for duplicate ml_email values, not the originals - which are ml_id - 379219 and 379202 )
My obvious problem that I’ve apparently not gotten my head around yet - returning multiple records/columns where only ONE column has duplicates but the rest are unique. I was previously doing this in Coldfusion using a loop - until the client tossed 30,000 or so records at it at once. So, I’m trying to let the database do the heavy lifting now. Makes my head hurt.
SELECT nl_catREL.groups_id
, duplicates.ml_id
FROM nl_catREL
INNER
JOIN nl_mailgroups
ON nl_mailgroups.ml_id = nl_catREL.ml_id
INNER
JOIN nl_mailgroups AS duplicates
ON duplicates.ml_email = nl_mailgroups.ml_email
AND duplicates.ml_id > nl_mailgroups.ml_id
WHERE nl_catREL.groups_id = 1100
Guess I should have said “approximately” 150 records. I just created a new group with EXACTLY 150 reocrds.
To double check. the below returns 150 records:
SELECT nl_catREL.groups_id,nl_mailgroups.ml_id,nl_mailgroups.ml_email
FROM nl_catREL INNER JOIN
nl_mailgroups ONnl_catREL.ml_id =nl_mailgroups.ml_id
WHERE (nl_catREL.groups_id = 1104)
Your last query, adjusted for the new group id, returns exactly what it should now, which is making me wonder why it didn’t work correctly before.
SELECT DISTINCT nl_catREL.groups_id, duplicates.ml_id
FROM nl_catREL
INNER JOIN
nl_mailgroups ON nl_mailgroups.ml_id = nl_catREL.ml_id
INNER JOIN
nl_mailgroups AS duplicates ON duplicates.ml_email = nl_mailgroups.ml_email
AND duplicates.ml_id > nl_mailgroups.ml_id
WHERE (nl_catREL.groups_id = 1104)
The duplicates are returned bow. The new group had one additional one, noted in the return as well.
Question - why not a sub query using “WHERE EXISTS”? Most of the examples I was finding went in that direction and that’s what I’ve been attempting for way too many hours with awful results.
I have some more testing to do with different groups and such, but I think you’ve got it!
This is part of a larger application that imports records from an Excel sheet written in Coldfusion. It was originally using an older POI utility (well over 2000 lines of code), and was choking when the client tried to import very large Excel sheet with 10,000 or more records. It’s lightening fast compared to what it was.
This finding duplicates part was one of the main bottlenecks.
Actually, that’s the database doing that. I’m pasting the code into SQL Manager, and when I run the query, that’s what it does to it.
The real test of this was one large list that I know is in there with a little over 15,000 records. Doing that one outside the database was just crazy - 6 or 7 minutes to run. This is taking under 10 seconds with the same list.