PHP Function of the Week - similar text

[CENTER]Array of PHP functions
shuffle and pop
tell us the function
we’ll learn this stop


So, another function I almost never use - but not without reason, I prefer the function in PHP with the funniest (to me) name - [fphp]levenshtein[/fphp]. The use case for both the functions is much the same, though the levenshtein algorithm is more efficient. @ScallioXTX will be along later in the week to explain why when he has some free time - he’s better at that sort of thing than I am.

The use case for either function is a bit limited. I use it in an Ajax call up to create a drop down list of possible users for the admin interface.

While researching this function I found a couple of references on how to embed levenshtein into MySQL as a user defined function. Under some circumstances that might prove to be more powerful than performing the match operation in PHP.

Basically the difference is that levenshtein returns the sum of the number of characters that need to be 1/ added, 2/ replaced or 3/ removed to get from one string to another. So for example the difference between ScallioXTX and Scylio is 5. 3 characters removed (XTX), 1 replaced (y instead of a), and 1 removed (1 of the Ls) = 5.

similar_text however returns how many of the characters that are in the first string are also in the second string, according to the manual, but I can’t quite reproduce this.

For example

$alpha = 'abcdefghijklmnopqrstuvwxyz';
$st = similar_text(strrev($alpha), $alpha, $perc);
var_dump($st);   // 1
var_dump($perc); // 3.8461538461538

php seems to think those strings have 1 character in common, while in fact they have no characters in common whatsoever…

Also, similar_text($a, $b) != similar_text($b, $a)

$str1 = 'PHP IS GREAT';
$str2 = 'WITH MYSQLAS';

$st = similar_text($str1, $str2, $perc);
var_dump($st);   // 4
var_dump($perc); // 33.333333333333

$st = similar_text($str2, $str1, $perc);
var_dump($st);   // 2
var_dump($perc); // 16.666666666667

Which seems weird to me. The number of characters they have in common should not be different when you swap the parameters IMO, especially when the strings have the same length and the difference can not be attributed to that.

Beside that similar_text is a lot slower than levenhstein (O(max(n,m)^3 vs O(n*m) – where n is the length of the first string and m is the length of the second string).

Long story short, unless you have a very specific usecase where you are sure that similar_text does exactly what you need, it’s probably advisable to go with levenhstein instead, which is a faster and more predictable.

Thanks Scallio

There is a note in the manual which says:

So if 0 is reserved for an empty string, perhaps 1 is somehow indicative of there being for no matches.

But if you stick a 1 in between m and n, which the reverse it compared it pivots on that integer, and of course returns 1 as expected.

An oddity I guess.