SitePoint Sponsor

User Tag List

Results 1 to 9 of 9
  1. #1
    SitePoint Enthusiast
    Join Date
    Dec 2005
    Posts
    42
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Detecting duplicate/similar rows in mysql

    Hi all,
    so we are scraping multiple sites to get information about events. And sometimes multiple sites contain the same info. Obviously, the system takes it thinking its not duplicate because its not really an exact duplicate. Example below.

    row 1: Roger Waters, Air Canada Center, Toronto, ON
    row 2: An Evening with Roger Waters, Air Canada, Toronto, ON


    Is there a way in MySQL to go through all the rows and find similar rows so to speak, Can I use full text indexing somehow to figure out a percentage match on the rows and create a threshold( if % match on 2 rows is greater than 90, its considered a duplicate)?

    Any help on this is greatly appreciated.

    sincerely,

  2. #2
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,350
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    there is no way you're going to get satisfactory results with SQL

    perhaps some artificial intelligence software outside the database?

    if it were up to me, i'd hire some human editors

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  3. #3
    SitePoint Enthusiast
    Join Date
    Dec 2005
    Posts
    42
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    there is no way you're going to get satisfactory results with SQL

    perhaps some artificial intelligence software outside the database?

    if it were up to me, i'd hire some human editors


    Thats exactly what we are thinking as well . But the way we want to incorporate human intervention is first "suggest" possible duplications to the admin, so that they can approve/reject it. Once thats done the system will keep a dictionary to remember the response so that it can use it for future duplications. So in time, there will be less and less suggestions and the system should be able to pick up duplicates by itself. Dont know if that makes sense
    So do you have any thoughts on getting some results ( which may not be satisfactory ) to work with?
    thanks in advance for your time.

    sincerely,

  4. #4
    SitePoint Member Mez's Avatar
    Join Date
    Mar 2008
    Posts
    6
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    FULLTEXT doesn't allow conditionals so you can rule that out.

    Store each word for an event in a separate table with an associated event_id. When subsequent events are inserted check against the word table and based on the number/percentage of matches for a single event_id flag it for review.

    Words for row 1 (event_id: 1)
    Roger
    Waters
    Air
    Canada
    Center
    Toronto
    ON

    ...would result in 86% of words found in event_id 1 exist in event_id 2, 67% of words in event_id 2 match an existing event.

  5. #5
    SitePoint Enthusiast
    Join Date
    Dec 2005
    Posts
    42
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Mez View Post
    FULLTEXT doesn't allow conditionals so you can rule that out.

    Store each word for an event in a separate table with an associated event_id. When subsequent events are inserted check against the word table and based on the number/percentage of matches for a single event_id flag it for review.

    Words for row 1 (event_id: 1)
    Roger
    Waters
    Air
    Canada
    Center
    Toronto
    ON

    ...would result in 86% of words found in event_id 1 exist in event_id 2, 67% of words in event_id 2 match an existing event.


    Hi Mez

    Thanks for the reply, unfortunately, I need to do this completely on mysql, is there a way I can sort of go through each of the events and break up the string that holds the description? and then formulate a result through sql.

    is there a way to find out an amount of similarity between two strings using mysql?


    thanks

  6. #6
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,350
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    Quote Originally Posted by abdullahc View Post
    I need to do this completely on mysql
    may i ask why?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  7. #7
    SitePoint Enthusiast
    Join Date
    Dec 2005
    Posts
    42
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    may i ask why?
    Hi, sure!
    This site is scraping articles, so sometimes there are duplicate articles ( same content) but a slightlly different title.

    So we want to take the two titles, compare it , and if the titles match more than 90% , datetime is close within a range, we dont want to show it on the site.

    any help is appreciated, this is a serious roadblock from our end
    thanks!

  8. #8
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,350
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    ah, scraping, i see

    well, good luck with that then
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  9. #9
    SitePoint Enthusiast
    Join Date
    Dec 2005
    Posts
    42
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi guys,

    so we got around this problem by using a php cron job.

    we do have a trigger and a store procedure in the mysql db to flag events that are similar by comparing the date time and using LIKE %

    after that we run a php cron job to go through all the records that have that flag, compare it to each other using similar_text()

    this worked, still slow, but on the server side ofcourse.

    thanks for all your help


    Quote Originally Posted by abdullahc View Post
    Thats exactly what we are thinking as well . But the way we want to incorporate human intervention is first "suggest" possible duplications to the admin, so that they can approve/reject it. Once thats done the system will keep a dictionary to remember the response so that it can use it for future duplications. So in time, there will be less and less suggestions and the system should be able to pick up duplicates by itself. Dont know if that makes sense
    So do you have any thoughts on getting some results ( which may not be satisfactory ) to work with?
    thanks in advance for your time.

    sincerely,


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •