SitePoint Sponsor

User Tag List

Results 1 to 14 of 14
  1. #1
    SitePoint Member
    Join Date
    May 2004
    Location
    Montreal, QC, Canada
    Posts
    5
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    SELECT DISTINCT - return entire row

    I read many of the posts here, but found no answer to the following problem: I have a database with entries like:

    DROP TABLE IF EXISTS junktest;
    CREATE TABLE junktest (
    id int(11) NOT NULL auto_increment,
    name char(200) NOT NULL default '',
    place char(200) NOT NULL default '',
    email char(100) NOT NULL default '',
    ip char(50) NOT NULL default '',
    PRIMARY KEY (id)
    ) TYPE=MyISAM;
    INSERT INTO junktest VALUES
    ('', 'John Doe', 'London, UK', 'jon@hotmail.com', '100.100.100.100'),
    ('', 'Rachel Doe', 'London, UK', 'jon@hotmail.com', '100.100.100.100'),
    ('', 'Jane T.', 'Glasgow, UK', 'jane@server.net', '123.456.789.01'),
    ('', 'John Doe', 'London - UK', 'jon@hotmail.com', '100.100.100.100'),
    ('', 'Peter Noth', 'Toronto, Canada', 'pete@hotmail.com', '101.101.101.101'),
    ('', 'John Doe', 'Washington, US', 'jonnie@whitehouse.org', '103.104.105.106');


    I want to eliminate duplicate entries like the first and the fourth, since they are essentially the same. However, several people can share the same email address, and several people can have the same name. Therefore I want to get only the distinct entries for name, email, and ip. The problem: when I do this

    SELECT DISTINCT name, email, ip FROM junktest

    ...I get

    John Doe jon@hotmail.com 100.100.100.100
    Rachel Doe jon@hotmail.com 100.100.100.100
    Jane T. jane@server.net 123.456.789.01
    Peter Noth pete@hotmail.com 101.101.101.101
    John Doe jonnie@whitehouse.org 103.104.105.106


    but I need the place column as well. Any suggestions? Thanks in advance,

    - thor .
    Last edited by thorvandahl; May 29, 2004 at 13:17.

  2. #2
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,347
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    you need the place column as well as what else?

    what i think you're saying is you want to identify "duplicates" based on only 3 columns, name, email, ip

    suppose there are two rows with the same values in all three columns

    you want both rows, with all columns?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  3. #3
    SitePoint Member
    Join Date
    May 2004
    Location
    Montreal, QC, Canada
    Posts
    5
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    well, the distinct criteria applies to the three columns namely
    (name, email, ip). the reason is, i admit, a badly designed form which at times entered "London, UK" and others "London - UK" in the "place" column. so something like this
    ('', 'John Doe', 'London, UK', 'jon@hotmail.com', '100.100.100.100'),
    ('', 'John Doe', 'London - UK', 'jon@hotmail.com', '100.100.100.100');

    would be identical. now, when I do
    SELECT DISTINCT name, email, ip FROM junktest
    i get
    John Doe jon@hotmail.com 100.100.100.100
    excellent! but there's no "place" column in the returned query... i would like to get
    John Doe London UK jon@hotmail.com 100.100.100.100
    ...however when i do
    SELECT DISTINCT name, place, email, ip FROM junktest
    i will get
    John Doe London, UK jon@hotmail.com 100.100.100.100
    John Doe London - UK jon@hotmail.com 100.100.100.100

    but i am interested in only one such entry... is there perhaps a wildcard match which could be used on the "place" column?

    thanks again,

    - thor .
    Last edited by thorvandahl; May 29, 2004 at 13:15.

  4. #4
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,347
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    "but i am interested in only one such entry"

    there's your challenge -- you need to have a rule for deciding which one
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  5. #5
    Chessplayer kleineme's Avatar
    Join Date
    Apr 2004
    Location
    Germany
    Posts
    608
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi,

    if it's irrelevant which row you get you may try something like this:

    Code:
    SELECT * FROM junktest GROUP BY name, email, ip
    Never ascribe to malice,
    that which can be explained by incompetence.
    Your code should not look unmaintainable, just be that way.

  6. #6
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,347
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    if you do that, you could get id and place names from different rows!
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  7. #7
    SitePoint Member
    Join Date
    May 2004
    Location
    Montreal, QC, Canada
    Posts
    5
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937
    if you do that, you could get id and place names from different rows!
    ...I tried
    SELECT * FROM junktest GROUP BY name, email, ip ORDER BY id
    and it seems to do the trick... so when you say i could get id and place names from different rows do you mean there will be a mis-alignment of column to row data?

  8. #8
    Chessplayer kleineme's Avatar
    Join Date
    Apr 2004
    Location
    Germany
    Posts
    608
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi,

    obviously you are right according to the manual. But what about the suggestion about using MAX or MIN? Wouldn't that do the trick in this special case?

    Code:
    SELECT name, email, ip, MAX(place) FROM junktest GROUP BY name, email, ip
    Never ascribe to malice,
    that which can be explained by incompetence.
    Your code should not look unmaintainable, just be that way.

  9. #9
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,347
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    now we're heading in the right direction, kleineme

    how would you also add id to the result row?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  10. #10
    Chessplayer kleineme's Avatar
    Join Date
    Apr 2004
    Location
    Germany
    Posts
    608
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi,

    is this some kind of quiz? What are the prizes?

    You want me to say "select max(id), max(place), name, email, ip FROM junktest GROUP BY name, email, ip", but I won't fall for that, because this could give me the wrong ID for a given place

    So, maybe we can try to concatenate the two columns:

    Code:
    SELECT MAX(CONCAT(id, place)), name, email, ip FROM junktest GROUP BY name, email, ip
    Never ascribe to malice,
    that which can be explained by incompetence.
    Your code should not look unmaintainable, just be that way.

  11. #11
    SitePoint Member
    Join Date
    May 2004
    Location
    Montreal, QC, Canada
    Posts
    5
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    my main questions have been answered (thanks everyone!!) i just need some clarification on one thing: when you say its possible to "get id and place names from different rows" as related to the latest posts, do you mean that there's a possibility of getting a completely wrong id associated with a given place/row, as in a misalignment of some kind, or just get an arbitrary id for one of the records which meet the criteria...?...??... as in, to use the above examples, if i had three records like
    1 John Doe London, UK jon@hotmail.com 100.100.100.100
    2345 John Doe London - UK jon@hotmail.com 100.100.100.100
    557 John Doe London ; UK jon@hotmail.com 100.100.100.100
    ...i would arbitrarily get 1 or 2345 or 557 ?...

    - thor .

  12. #12
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,347
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    misaligned, yes, that would be one way to describe it

    remember, the mysql folks themselves say that the results are unpredictable

    and of course totally illegal in every other database, so it will serve you well to find a way around the problem now
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  13. #13
    SitePoint Member
    Join Date
    May 2004
    Location
    Montreal, QC, Canada
    Posts
    5
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    my solution

    Quote Originally Posted by r937
    remember, the mysql folks themselves say that the results are unpredictable
    ah yes, i saw the link you posted to http://dev.mysql.com/doc/mysql/en/GR...den-fields.htm

    here's my brute-force solution, hope it helps someone some day... it is useful to note that i don't care about the exact id number as long as my entries follow their original order (as in a heap). here i use a temporary table to sort things the way i need them, and then put them back into the original where they will get auto re-id-ed..

    DROP TABLE IF EXISTS junktest_temp;
    CREATE TABLE junktest_temp(id int(11), name char(200), place char(200), email char(100), ip char(50))
    TYPE=HEAP;
    INSERT INTO junktest_temp(name, place, email, ip) SELECT name, MAX(place), email, ip FROM junktest GROUP BY name, email, ip ORDER BY id;
    DELETE FROM junktest;
    INSERT INTO junktest(id, name, place, email, ip) SELECT id, name, place, email, ip FROM junktest_temp;
    Last edited by thorvandahl; Jun 1, 2004 at 18:31.

  14. #14
    SitePoint Wizard
    Join Date
    Jul 2003
    Location
    Kent
    Posts
    1,921
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    why not try a count() on you name, email, id, having count()>1 that will bring up those with duplicates for you. Or am I misunderstanding the question?


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •