Identify extremely similar rows

I have a table in which I’d like a column to have no duplicate data. I’m familiar with the old standard of using HAVING and GROUP BY to identify exact matches, and the actual hard UNIQUE constraint for future data. This is proving to not be good enough though.

What I’d like is to produce a query/algorithm to identify extremely similar data (ie. if only 2-3 characters are off in a 200 character string, I need to kill one of those rows), so that I can scrub up my data a bit better. It’s making my brain spin. Any hints?

Levenshtein Distance springs to mind.

Here’s an article about it specific for MySQL: http://codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function/

Wow, this looks perfect. Going to try implementing it tonight. Thank you!

Levenshtein functions installed and seem to be doing what they should, looks like this will work, thanks again. To anyone that ends up following after, phpMyAdmin doesn’t seem to play nice with stored functions. Here’s exactly what I did on the command line to make this work right (max string size is adjusted to 400 characters). :slight_smile:

mysql> delimiter //
mysql> CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(400), s2 VARCHAR(400))                       
    ->   RETURNS INT
    ->     DETERMINISTIC
    ->       BEGIN
    ->         DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    ->         DECLARE s1_char CHAR;
    ->         DECLARE cv0, cv1 VARBINARY(256);
    ->         SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
    ->         IF s1 = s2 THEN
    ->           RETURN 0;
    ->         ELSEIF s1_len = 0 THEN
    ->           RETURN s2_len;
    ->         ELSEIF s2_len = 0 THEN
    ->           RETURN s1_len;
    ->         ELSE
    ->           WHILE j <= s2_len DO
    ->             SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
    ->           END WHILE;
    ->           WHILE i <= s1_len DO
    ->             SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
    ->             WHILE j <= s2_len DO
    ->                 SET c = c + 1;
    ->                 IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
    ->                 SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
    ->                 IF c > c_temp THEN SET c = c_temp; END IF;
    ->                 SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
    ->                 IF c > c_temp THEN SET c = c_temp; END IF;
    ->                 SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
    ->             END WHILE;
    ->             SET cv1 = cv0, i = i + 1;
    ->           END WHILE;
    ->         END IF;
    ->         RETURN c;
    ->       END//
Query OK, 0 rows affected (0.02 sec)

mysql> CREATE FUNCTION LEVENSHTEIN_RATIO (s1 VARCHAR(400), s2 VARCHAR(400))
    ->   RETURNS INT
    ->     DETERMINISTIC
    ->       BEGIN
    ->         DECLARE s1_len, s2_len, max_len INT;
    ->         SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
    ->         IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF;
    ->         RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
    ->       END//
Query OK, 0 rows affected (0.00 sec)

mysql> delimiter ;

Alright, one more mental block. Let’s say I have 100k quotes of text. Here’s how the Levenshtein Function along with the ratio function works:

SELECT LEVENSHTEIN_RATIO((SELECT text FROM articles WHERE a_id = 50678),(SELECT text FROM articles WHERE a_id = 35039)) AS ratio

This returns a percentage of how similar the two are, and is working perfectly. From the looks of it, I’ll probably look to combine those that have a ratio of 80 or higher once I’ve gathered the data.

Now, is there a way to realistically check through 100k rows using purely SQL and the levenshtein functions? I’d be eternally grateful to anyone smart enough to dodge the slow polling of a very PHP-heavy / multiple queried cron that I’m burdened with otherwise. :slight_smile:

What you could try is order all string by length and only compare each string to each peers; say, strings that between 5 shorter and 5 longer than the one you’re comparing to.
The likeliness of two strings looking similar will get smaller as the difference between the lengths of the strings increases.
Not sure if it helps since it’s highly dependent on the data at hand, but it’s worth a try methinks.

That helps drastically actually, though I’m still looking at 486 days of polling to get through all the data. Algorithm currently goes:

SELECT current article id, artice (1 query)
SELECT all other articles having text length within a few characters (1 query)
COMPARE LEVENSHTEIN_RATIO, store duplicates if found in the DB (10k-40k queries, down from 100k+ before reducing results using string length)
STORE next article ID in the DB to try in another 5 minutes, starting process over (1 query)

I’m at about 20 seconds in running this query when it compares to 10k other rows, potentially in excess of a minute on others. If I could make this efficient enough to run once every minute, the polling would be done in more like a month, which I may be patient enough for.

And if you perform all the levenshtein stuff in your programming language instead of querying MySQL for it each time?
That should ~theoretically~ be faster

Incredibly faster, I didn’t even think to try PHP’s levenshtein function.

That can work!

Let me know the final results please :slight_smile:

Sure, final verdict: PHP levenshtein comparisons are a great solution for this. There are lots of modified versions of levenshtein() on php.net that all seem to work better though. Here’s the one I ended up going with:

function levenshteinDistance2($str1, $str2)
{
    $len1 = mb_strlen($str1);
    $len2 = mb_strlen($str2);
    
    // strip common prefix
    $i = 0;
    do {
        if(mb_substr($str1, $i, 1) != mb_substr($str2, $i, 1))
            break;
        $i++;
        $len1--;
        $len2--;
    } while($len1 > 0 && $len2 > 0);
    if($i > 0) {
        $str1 = mb_substr($str1, $i);
        $str2 = mb_substr($str2, $i);
    }
    
    // strip common suffix
    $i = 0;
    do {
        if(mb_substr($str1, $len1-1, 1) != mb_substr($str2, $len2-1, 1))
            break;
        $i++;
        $len1--;
        $len2--;
    } while($len1 > 0 && $len2 > 0);
    if($i > 0) {
        $str1 = mb_substr($str1, 0, $len1);
        $str2 = mb_substr($str2, 0, $len2);
    }
    
    if ($len1 == 0)
        return $len2;    
    if ($len2 == 0)
        return $len1;
    
    $v0 = range(0, $len1);
    $v1 = array();
    
    for ($i = 1; $i <= $len2; $i++) {
        $v1[0] = $i;
        $str2j = mb_substr($str2, $i - 1, 1);
        
        for ($j = 1; $j <= $len1; $j++) {
            $cost = (mb_substr($str1, $j - 1, 1) == $str2j) ? 0 : 1;
            
            $m_min = $v0[$j] + 1;
            $b = $v1[$j - 1] + 1;
            $c = $v0[$j - 1] + $cost;
            
            if ($b < $m_min)
                $m_min = $b;
            if ($c < $m_min)
                $m_min = $c;
            
            $v1[$j] = $m_min;
        }
        
        $vTmp = $v0;
        $v0 = $v1;
        $v1 = $vTmp;
    }
    
    return $v0[$len1];
}

I’ve come up with a solution using this function that’s polling millions of rows of fulltext against millions of other rows, now due to finish in right around 40 days of a one-minute cron, and running great at this moment. Thanks again!

Sounds good. Thanks for coming back to this; I appreciate it :slight_smile: