SitePoint Sponsor

User Tag List

Results 1 to 12 of 12
  1. #1
    SitePoint Enthusiast
    Join Date
    Apr 2005
    Location
    Normal, Illinois
    Posts
    57
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    identify extremely similar rows

    I have a table in which I'd like a column to have no duplicate data. I'm familiar with the old standard of using HAVING and GROUP BY to identify exact matches, and the actual hard UNIQUE constraint for future data. This is proving to not be good enough though.

    What I'd like is to produce a query/algorithm to identify extremely similar data (ie. if only 2-3 characters are off in a 200 character string, I need to kill one of those rows), so that I can scrub up my data a bit better. It's making my brain spin. Any hints?
    hi!

  2. #2
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,070
    Mentioned
    153 Post(s)
    Tagged
    2 Thread(s)
    Levenshtein Distance springs to mind.

    Here's an article about it specific for MySQL: http://codejanitor.com/wp/2007/02/10...ored-function/
    Rémon - Hosting Advisor

    SitePoint forums will switch to Discourse soon! Make sure you're ready for it!

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  3. #3
    SitePoint Enthusiast
    Join Date
    Apr 2005
    Location
    Normal, Illinois
    Posts
    57
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Wow, this looks perfect. Going to try implementing it tonight. Thank you!
    hi!

  4. #4
    SitePoint Enthusiast
    Join Date
    Apr 2005
    Location
    Normal, Illinois
    Posts
    57
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Levenshtein functions installed and seem to be doing what they should, looks like this will work, thanks again. To anyone that ends up following after, phpMyAdmin doesn't seem to play nice with stored functions. Here's exactly what I did on the command line to make this work right (max string size is adjusted to 400 characters).


    Code:
    mysql> delimiter //
    mysql> CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(400), s2 VARCHAR(400))                       
        ->   RETURNS INT
        ->     DETERMINISTIC
        ->       BEGIN
        ->         DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
        ->         DECLARE s1_char CHAR;
        ->         DECLARE cv0, cv1 VARBINARY(256);
        ->         SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
        ->         IF s1 = s2 THEN
        ->           RETURN 0;
        ->         ELSEIF s1_len = 0 THEN
        ->           RETURN s2_len;
        ->         ELSEIF s2_len = 0 THEN
        ->           RETURN s1_len;
        ->         ELSE
        ->           WHILE j <= s2_len DO
        ->             SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
        ->           END WHILE;
        ->           WHILE i <= s1_len DO
        ->             SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
        ->             WHILE j <= s2_len DO
        ->                 SET c = c + 1;
        ->                 IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
        ->                 SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
        ->                 IF c > c_temp THEN SET c = c_temp; END IF;
        ->                 SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
        ->                 IF c > c_temp THEN SET c = c_temp; END IF;
        ->                 SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
        ->             END WHILE;
        ->             SET cv1 = cv0, i = i + 1;
        ->           END WHILE;
        ->         END IF;
        ->         RETURN c;
        ->       END//
    Query OK, 0 rows affected (0.02 sec)
    
    mysql> CREATE FUNCTION LEVENSHTEIN_RATIO (s1 VARCHAR(400), s2 VARCHAR(400))
        ->   RETURNS INT
        ->     DETERMINISTIC
        ->       BEGIN
        ->         DECLARE s1_len, s2_len, max_len INT;
        ->         SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
        ->         IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF;
        ->         RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
        ->       END//
    Query OK, 0 rows affected (0.00 sec)
    
    mysql> delimiter ;
    hi!

  5. #5
    SitePoint Enthusiast
    Join Date
    Apr 2005
    Location
    Normal, Illinois
    Posts
    57
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Alright, one more mental block. Let's say I have 100k quotes of text. Here's how the Levenshtein Function along with the ratio function works:

    Code:
    SELECT LEVENSHTEIN_RATIO((SELECT text FROM articles WHERE a_id = 50678),(SELECT text FROM articles WHERE a_id = 35039)) AS ratio
    This returns a percentage of how similar the two are, and is working perfectly. From the looks of it, I'll probably look to combine those that have a ratio of 80 or higher once I've gathered the data.

    Now, is there a way to realistically check through 100k rows using purely SQL and the levenshtein functions? I'd be eternally grateful to anyone smart enough to dodge the slow polling of a very PHP-heavy / multiple queried cron that I'm burdened with otherwise.
    hi!

  6. #6
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,070
    Mentioned
    153 Post(s)
    Tagged
    2 Thread(s)
    What you could try is order all string by length and only compare each string to each peers; say, strings that between 5 shorter and 5 longer than the one you're comparing to.
    The likeliness of two strings looking similar will get smaller as the difference between the lengths of the strings increases.
    Not sure if it helps since it's highly dependent on the data at hand, but it's worth a try methinks.
    Rémon - Hosting Advisor

    SitePoint forums will switch to Discourse soon! Make sure you're ready for it!

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  7. #7
    SitePoint Enthusiast
    Join Date
    Apr 2005
    Location
    Normal, Illinois
    Posts
    57
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    That helps drastically actually, though I'm still looking at 486 days of polling to get through all the data. Algorithm currently goes:

    SELECT current article id, artice (1 query)
    SELECT all other articles having text length within a few characters (1 query)
    COMPARE LEVENSHTEIN_RATIO, store duplicates if found in the DB (10k-40k queries, down from 100k+ before reducing results using string length)
    STORE next article ID in the DB to try in another 5 minutes, starting process over (1 query)

    I'm at about 20 seconds in running this query when it compares to 10k other rows, potentially in excess of a minute on others. If I could make this efficient enough to run once every minute, the polling would be done in more like a month, which I may be patient enough for.
    hi!

  8. #8
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,070
    Mentioned
    153 Post(s)
    Tagged
    2 Thread(s)
    And if you perform all the levenshtein stuff in your programming language instead of querying MySQL for it each time?
    That should ~theoretically~ be faster
    Rémon - Hosting Advisor

    SitePoint forums will switch to Discourse soon! Make sure you're ready for it!

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  9. #9
    SitePoint Enthusiast
    Join Date
    Apr 2005
    Location
    Normal, Illinois
    Posts
    57
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by ScallioXTX View Post
    And if you perform all the levenshtein stuff in your programming language instead of querying MySQL for it each time?
    That should ~theoretically~ be faster
    Incredibly faster, I didn't even think to try PHP's levenshtein function.

    Quote Originally Posted by microtime()+mysql_num_rows()
    This page took 0.520920 seconds to compare 103060 records.
    That can work!
    hi!

  10. #10
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,070
    Mentioned
    153 Post(s)
    Tagged
    2 Thread(s)
    Let me know the final results please
    Rémon - Hosting Advisor

    SitePoint forums will switch to Discourse soon! Make sure you're ready for it!

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  11. #11
    SitePoint Enthusiast
    Join Date
    Apr 2005
    Location
    Normal, Illinois
    Posts
    57
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Sure, final verdict: PHP levenshtein comparisons are a great solution for this. There are lots of modified versions of levenshtein() on php.net that all seem to work better though. Here's the one I ended up going with:

    PHP Code:
    function levenshteinDistance2($str1$str2)
    {
        
    $len1 mb_strlen($str1);
        
    $len2 mb_strlen($str2);
        
        
    // strip common prefix
        
    $i 0;
        do {
            if(
    mb_substr($str1$i1) != mb_substr($str2$i1))
                break;
            
    $i++;
            
    $len1--;
            
    $len2--;
        } while(
    $len1 && $len2 0);
        if(
    $i 0) {
            
    $str1 mb_substr($str1$i);
            
    $str2 mb_substr($str2$i);
        }
        
        
    // strip common suffix
        
    $i 0;
        do {
            if(
    mb_substr($str1$len1-11) != mb_substr($str2$len2-11))
                break;
            
    $i++;
            
    $len1--;
            
    $len2--;
        } while(
    $len1 && $len2 0);
        if(
    $i 0) {
            
    $str1 mb_substr($str10$len1);
            
    $str2 mb_substr($str20$len2);
        }
        
        if (
    $len1 == 0)
            return 
    $len2;    
        if (
    $len2 == 0)
            return 
    $len1;
        
        
    $v0 range(0$len1);
        
    $v1 = array();
        
        for (
    $i 1$i <= $len2$i++) {
            
    $v1[0] = $i;
            
    $str2j mb_substr($str2$i 11);
            
            for (
    $j 1$j <= $len1$j++) {
                
    $cost = (mb_substr($str1$j 11) == $str2j) ? 1;
                
                
    $m_min $v0[$j] + 1;
                
    $b $v1[$j 1] + 1;
                
    $c $v0[$j 1] + $cost;
                
                if (
    $b $m_min)
                    
    $m_min $b;
                if (
    $c $m_min)
                    
    $m_min $c;
                
                
    $v1[$j] = $m_min;
            }
            
            
    $vTmp $v0;
            
    $v0 $v1;
            
    $v1 $vTmp;
        }
        
        return 
    $v0[$len1];

    I've come up with a solution using this function that's polling millions of rows of fulltext against millions of other rows, now due to finish in right around 40 days of a one-minute cron, and running great at this moment. Thanks again!
    hi!

  12. #12
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,070
    Mentioned
    153 Post(s)
    Tagged
    2 Thread(s)
    Sounds good. Thanks for coming back to this; I appreciate it
    Rémon - Hosting Advisor

    SitePoint forums will switch to Discourse soon! Make sure you're ready for it!

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •