SitePoint Sponsor

User Tag List

Results 1 to 3 of 3
  1. #1
    SitePoint Member
    Join Date
    May 2008
    Posts
    2
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Looking for duplicates on a very large table

    I have a table with around 21 million rows

    The schema looks like this:
    id int (Auto increment value)
    firstame varchar(20)
    lastame varchar(25)
    middlename varchar(15)
    dob varchar(10) (Dates are strings like 01/01/1999)
    state varchar(2) (OR, WA & CA)


    I want to find all of the duplicates that exist based on Firstnam, Lastname and DOB based on the decade of birth. The code I wrote for this is below:

    Code:
    SELECT c.lastname, c.firstname, c.dob,  c.Id, c.middlename, c.state
    FROM combined c
    WHERE substring(c.dob,-4) between 1900 and 1909
    AND (c.lastname in
    (select lastname from combined
    where substring(dob,-4) between 1900 and 1909
    group by lastname,firstname,dob HAVING count(*)>1
    and firstname = c.firstname
    and dob = c.dob))
    Is this the best method? Should I be using a join? It seems to take forever even with only 65,000 entries from 1900 - 1909...

    Any help getting this thing working faster would be much appreciated.

  2. #2
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,350
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    you will forever be slowed down because of the substring

    why not do the whole table at once?

    if you had an index on (dob,lastname,firstname) the query should actually fly, since most restrictive is dob
    Code:
    SELECT c.lastname, c.firstname, c.dob
         ,  c.Id, c.middlename, c.state
      FROM combined c
    INNER
      JOIN ( SELECT lastname, firstname, dob
               FROM combined
              WHERE substring(c.dob,-4) between 1900 and 1909
             GROUP
                 BY lastname, firstname, dob
             HAVING COUNT(*) > 1 ) dupes
        ON dupes.lastname = c.lastname
       AND dupes.firstname = c.firstname
       AND dupes.dob = c.dob
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  3. #3
    SitePoint Member
    Join Date
    May 2008
    Posts
    2
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for the direction pointing.... Its working relatively fast using that JOIN method. I have never seen a join to a "sub select" like that.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •