SitePoint Sponsor

User Tag List

Results 1 to 4 of 4
  1. #1
    I solve practical problems. bronze trophy
    Michael Morris's Avatar
    Join Date
    Jan 2008
    Location
    Knoxville TN
    Posts
    2,023
    Mentioned
    62 Post(s)
    Tagged
    0 Thread(s)

    PHP Function of the Week - similar text

    Array of PHP functions
    shuffle and pop
    tell us the function
    we'll learn this stop


    similar_text


    So, another function I almost never use - but not without reason, I prefer the function in PHP with the funniest (to me) name - levenshtein. The use case for both the functions is much the same, though the levenshtein algorithm is more efficient. @ScallioXTX will be along later in the week to explain why when he has some free time - he's better at that sort of thing than I am.

    The use case for either function is a bit limited. I use it in an Ajax call up to create a drop down list of possible users for the admin interface.

    While researching this function I found a couple of references on how to embed levenshtein into MySQL as a user defined function. Under some circumstances that might prove to be more powerful than performing the match operation in PHP.

  2. #2
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,031
    Mentioned
    152 Post(s)
    Tagged
    2 Thread(s)
    Basically the difference is that levenshtein returns the sum of the number of characters that need to be 1/ added, 2/ replaced or 3/ removed to get from one string to another. So for example the difference between ScallioXTX and Scylio is 5. 3 characters removed (XTX), 1 replaced (y instead of a), and 1 removed (1 of the Ls) = 5.

    similar_text however returns how many of the characters that are in the first string are also in the second string, according to the manual, but I can't quite reproduce this.

    For example

    PHP Code:
    <?php
    $alpha 
    'abcdefghijklmnopqrstuvwxyz';
    $st similar_text(strrev($alpha), $alpha$perc);
    var_dump($st);   // 1
    var_dump($perc); // 3.8461538461538
    php seems to think those strings have 1 character in common, while in fact they have no characters in common whatsoever...

    Also, similar_text($a, $b) != similar_text($b, $a)

    PHP Code:
    <?php
    $str1 
    'PHP IS GREAT';
    $str2 'WITH MYSQLAS';

    $st similar_text($str1$str2$perc);
    var_dump($st);   // 4
    var_dump($perc); // 33.333333333333

    $st similar_text($str2$str1$perc);
    var_dump($st);   // 2
    var_dump($perc); // 16.666666666667
    Which seems weird to me. The number of characters they have in common should not be different when you swap the parameters IMO, especially when the strings have the same length and the difference can not be attributed to that.

    Beside that similar_text is a lot slower than levenhstein (O(max(n,m)^3 vs O(n*m) -- where n is the length of the first string and m is the length of the second string).

    Long story short, unless you have a very specific usecase where you are sure that similar_text does exactly what you need, it's probably advisable to go with levenhstein instead, which is a faster and more predictable.
    Rémon - Hosting Advisor

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  3. #3
    I solve practical problems. bronze trophy
    Michael Morris's Avatar
    Join Date
    Jan 2008
    Location
    Knoxville TN
    Posts
    2,023
    Mentioned
    62 Post(s)
    Tagged
    0 Thread(s)
    Thanks Scallio

  4. #4
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by ScallioXTX
    PHP Code:
    <?php
    $alpha 
    'abcdefghijklmnopqrstuvwxyz';
    $st similar_text(strrev($alpha), $alpha$perc);
    var_dump($st);   // 1
    var_dump($perc); // 3.8461538461538
    php seems to think those strings have 1 character in common, while in fact they have no characters in common whatsoever...
    There is a note in the manual which says:

    Quote Originally Posted by manual
    Please note that this function calculates a similarity of 0 (zero) for two empty strings.
    PHP Code:
    <?php
    similar_text
    (""""$sim);
    echo 
    $sim// "0"
    ?>
    So if 0 is reserved for an empty string, perhaps 1 is somehow indicative of there being for no matches.

    But if you stick a 1 in between m and n, which the reverse it compared it pivots on that integer, and of course returns 1 as expected.

    An oddity I guess.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •