SitePoint Sponsor

User Tag List

Results 1 to 4 of 4
  1. #1
    SitePoint Member
    Join Date
    Jul 2001
    Location
    San Francisco Bay Area
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Flagging Duplicate Records

    Ok - so I have done a couple of hours of research on Google and these forums and still have not found a complete answer to my problem.

    I need to identify the duplicates in a table and flag them, so I can separate them from the rest of the data. The problem is, the only way to determine if they are duplicates is by comparing two fields, first_name and last_name.

    A record (for the sake of this discussion) is a duplicate if the concatenated string of first_name + last_name is equal to other records with the same data in those two fields.

    Using DISTINCT does not work because it includes one of the duplicate records in the result set.

    Any ideas?

  2. #2
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,347
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    you must have a weird problem because "sql duplicate records" on google turns up a boatload of relevant results on the first page in only a second or two
    Code:
    select firstname,lastname,count(*)
      from yourtable
    group by firstname,lastname
    having count(*) > 1
    this query will identify the duplicated combinations of firstname and lastname, and how many times each combination occurs

    to list the entire rows that contain those duplicated combinations is trickier
    Code:
    select *
      from yourtable
     where firstname||lastname
        in ( select firstname||lastname
               from yourtable
             group by firstname,lastname
             having count(*) > 1 )
    now, as for eliminating all but one of the rows with duplicated data, that's even more trickier, in fact, a lot will depend on how you want to decide which one to keep and what's in the other fields
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  3. #3
    SitePoint Member
    Join Date
    Jul 2001
    Location
    San Francisco Bay Area
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for your reply... I ended up using the following

    Code:
    UPDATE afl_data SET full_name=CONCAT(first_name,' ',last_name);
    to combine the two columns into one full_name field and then,

    Code:
    SELECT full_name, COUNT(full_name) FROM afl_data GROUP BY full_name HAVING COUNT(full_name)>1;
    to find the dups... where I am stuck is representing the results of this query in an UPDATE statement that would,

    Code:
    UPDATE afl_data SET dup_flag='d'
    as apparently you cannot use COUNT() in a WHERE clause and an UPDATE statement cannot use a HAVING clause... any clues?

    My goal is to simply flag the dups, not delete them or extract them from the table.

    ** I need to point out that I am working with a mySQL db where subselects do not work.
    Last edited by telark; Sep 13, 2002 at 07:45.

  4. #4
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,347
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    the way i would do it is

    create table tempafl
    (full_name varchar(50))

    insert into tempafl
    select full_name
    from afl_data
    group by full_name having count(*)>1

    delete afl_data.*
    from afl_data, tempafl
    where afl_data.full_name = tempafl.full_name

    insert into afl_data (full_name)
    select full_name from tempafl

    drop tempafl

    note the delete statement requires mysql 4.0.2, according to the docs
    Last edited by r937; Sep 13, 2002 at 14:53.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •