I am creating a database of quiz questions. There is an issue that duplicates may appear, since a question may be phrased in more than one way. For example:
Q: Which country was previously known as Rhodesia?
A: Zimbabwe
is a duplicate of
Q: Which African country was once called Rhodesia?
A: Zimbabwe
However, these two questions are obviously not duplicates:
Q: Which country was previously known as Rhodesia?
A: Zimbabwe
Q: Which country, after became a test cricket nation in 1992, had its test cricket status suspended in 2006?
A: Zimbabwe
So what I need to do is select all duplicate answers plus their questions, so that I can determine if any are actual duplicates.
Most online tutorials suggest something like this:
SELECT question, answer FROM questions GROUP BY answer HAVING COUNT(answer) > 1;
But that just gives me the first instance of the duplicate, whereas I need to display BOTH instances.
I’m just looking for dupes in the answer field tho.
If I can display both rows, then I can discern if the question is a duplicate. If I can only see the first row in the group, then I can’t.
Using the query in my previous post, I would get the following result:
Q: Which country was previously known as Rhodesia?
A: Zimbabwe
even though, there are 3 rows that contain Zimbabwe in the answer field. Obviously I need to delete one of the entries regarding Rhodesia, but the one about test cricket can stay.
So I want to return all entries that contain two or more identical answer fields. I’m just not sure what that query would be.
darn, i’m sorry, that’s so obvious, i don’t know why i didn’t understand it right off…
SELECT t.question
, t.answer
FROM ( SELECT answer
FROM questions
GROUP
BY answer
HAVING COUNT(*) > 1 ) AS d
INNER
JOIN questions AS t
ON t.answer = d.answer
ORDER
BY t.answer
the subquery, which has a table alias of d, should look familiar – it’s the same as the query you were given elsewhere, and it simply retrieves all the answers that are dupes
this subquery, when used in the FROM clause like that, is called a derived table, because it’s actually a table – all queries produce tables – and as a table, you can join it to other tables
so we join it back to the original table, using the answer column as the join column
now, the answer values in the derived table are unique, but each of them will join to two or more rows in the original table, right?
Ok, I see, so d stands for derived table, and t stands for original table.
So you join the derived table (the table of grouped rows) back onto the original table with an inner join on the answers field. Because its an inner join (rather than a left join for example), only the joined results show, which are the dupes.