SitePoint Sponsor

User Tag List

Results 1 to 15 of 15
  1. #1
    SitePoint Addict KJedi's Avatar
    Join Date
    Sep 2005
    Location
    Ukraine, Nikolaev
    Posts
    231
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Fuzzy search algorithm

    In the style of my previous post: http://www.sitepoint.com/forums/showthread.php?t=508740

    I have to perform a fuzzy search for the game titles.
    If user enters Lra Craft, I should show him:
    • Lara Croft
    • Lara Croft 2
    • Lara Croft III - Tomb Raider
    Highlighting the first one a s best match

    Yes, I can take levenshtein difference (built-in php function) or something like that, but it will work too long and I need a really fast solution.

    What I was thinking of, is placing the titles comparison logics to the MySQL 5 stored procedure, that should perform faster, but I need the full algorithm described in this case.

    Or maybe there are other good solutions... I can't speed up comparison by assuming first letters are spelled correctly, user can enter ara Croft, but should get the correct match with Lara Croft.

    P.S. If the problem described in the previous post could be solved with the same approach as this one and programmed as one class/stored procedure(s), that would be really ideal.

    Any comments and ideas are appreciated!
    Thanks beforehand!

  2. #2
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    You don't mention this so here goes, have you investigated using mysql's FULLTEXT search?

    If you have, how did it let you down?

  3. #3
    SitePoint Addict KJedi's Avatar
    Join Date
    Sep 2005
    Location
    Ukraine, Nikolaev
    Posts
    231
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Does it work well for MISSPELLED words???

  4. #4
    An average geek earl-grey's Avatar
    Join Date
    Mar 2005
    Location
    Ukraine
    Posts
    1,403
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You can create a dictionary table in database, which maps common misspellings to the correct words.

    When doing a search, you convert each misspelled word into a correct one, and perform a fulltext search.

  5. #5
    SitePoint Addict KJedi's Avatar
    Join Date
    Sep 2005
    Location
    Ukraine, Nikolaev
    Posts
    231
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    And if user enters an uncommon misspelling? I need real non-precise search, not search by dictionary... But if only there is a way to combine that variants. Consider first of all I search by dictionary (if that is faster), if none match found, then perform "fuzzy" search and remember this misspelling. But here again I need a non-precise search.
    But if using a combined approach, it shouldn't be mega-fast

    Any links or ideas about fuzzy search?

  6. #6
    Resident Code Monkey Chris Corbyn's Avatar
    Join Date
    Nov 2005
    Location
    Melbourne, Australia
    Posts
    713
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You can probably double FULLTEXT (for non-mispelled words because it's fast!) with SOUNDEX. I forget what the MySQL syntax for soundex is (maybe something like "SOUNDS LIKE".

  7. #7
    Resident Code Monkey Chris Corbyn's Avatar
    Join Date
    Nov 2005
    Location
    Melbourne, Australia
    Posts
    713
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Here we are: http://dev.mysql.com/doc/refman/5.0/...or_sounds-like

    There's actually a pure SOUNDEX() function too.

  8. #8
    SitePoint Addict KJedi's Avatar
    Join Date
    Sep 2005
    Location
    Ukraine, Nikolaev
    Posts
    231
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well, sounds like maybe a good solution instead of dictionary of misspelled words, but... if I have title in the DB like:
    World Snooker Championship 2007
    And user types:
    Snoker
    This fails.
    Any ideas?

    Well, I don't mind having table with common misspellings, but where to get this data? I don't need the whole dictionary, I need only words, that are met in game titles. And besides common words, there are personal names, which also can be misspelled...

    Hmm... Just got one more idea... What if I build index table with all the words, that are present in game titles. And then perform the following algorithm:
    1. get user string, perform fulltext search
    2. If anything found, that's OK
    3. If nothing found, then split the string into words
    4. Find matches in that index table using soundex
    5. Build all possible strings
    6. Take levenshtein distance from the initial string and all those I got
    7. Rank them in the order of matching
    8. Perform fulltext search one-by-one until I find any results.
    9. Display results and variants of the strings I've got and let user choose the variant he wants

    Or, for speeding up performance, skip the 8th step and simply show variants of his query and results for the string he actually typed.

    How do you find that? Any comments/critique?
    Last edited by KJedi; Oct 24, 2007 at 02:43.

  9. #9
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yes, using the words table is a good idea, I'd suggest you go one step further and use it as an index table:

    Code:
    table documents(id,title)
    table words(id, word)
    table word_index(word_id,doc_id)
    An entry in word_index means that the document doc_id contains the word word_id. Having that, your search algo may look like this

    Code:
    foreach $word in search_string
    	foreach record from "words" that sounds like $word
    		add "words" id to $word_ids
    
    # here we should have a list of search words' ids
    # bailout if it is empty
    
    if search_type is "any word"
    
    	# any words - use one 'union' query
    	
    	select doc_id from index
    		where index.word_id in ($word_ids)
    
    else 
    
    	# all words - use successive 'intersection' queries
    	# it is also possible to build one big join statement instead
    
    	foreach $word_ids as $id
    		$doc_ids[] = select doc_id from index
    			where index.word_id = $id 
    				and index.doc_id in ($doc_ids)

  10. #10
    SitePoint Zealot
    Join Date
    Oct 2007
    Location
    In the blogosphere
    Posts
    108
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Been awhile since I last touched on fuzzy searches but if you're interested in doing a levenshtein as a MySQL stored proc, feel free to drop by my blog :P

    I've posted the code there.
    (Yes, I know this is shameless blog plug, but I need the traffic... )
    bLueFrogX's Blog - Random Ramblings of a NEET Techie ★

  11. #11
    SitePoint Addict KJedi's Avatar
    Join Date
    Sep 2005
    Location
    Ukraine, Nikolaev
    Posts
    231
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by stereofrog View Post
    Yes, using the words table is a good idea, I'd suggest you go one step further and use it as an index table:
    .....
    Interesting, very interesting... Need to think about it. This way I don't use mysql's fulltext search, so I can't search through the word forms... But I can mix all that up
    And perform search in 3 stages.
    1. Simple fulltext (fastest)
    2. Like stereofrog proposed (sufficient)
    3. Building phrase forms like I proposed in my previous post ans searching them using mysql fulltext (quite slow)

    There can be also 0-level search - simple phrase search using %query%. I don't know if it is faster, than mysql's fulltext. Maybe, if I also have simple index on that field in addition to fulltext one.

    2 bluefrogx: I don't think I need this for thisproject any more... We'll see after I implement this search in my CMS. Maybe I'll have time and desire to do this as MySQL stored proc... Not sure

  12. #12
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by bluefrogx View Post
    Been awhile since I last touched on fuzzy searches but if you're interested in doing a levenshtein as a MySQL stored proc, feel free to drop by my blog :P

    I've posted the code there.
    (Yes, I know this is shameless blog plug, but I need the traffic... )
    Thanks for the link, btw. I was about to write mysql-levenshtein for one of my projects, now I can take yours. ) Did you do any bechmarking for it, esp. compared to the UDF?

  13. #13
    SitePoint Zealot
    Join Date
    Oct 2007
    Location
    In the blogosphere
    Posts
    108
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The UDF definitely has better performance compared to a stored proc. Doesn't mean the stored proc isn't useful though, if you're on windows, its about the nly choice you got Tbh, I haven't really done any serious benchmarking because I discovered the UDF shortly after I made the stored proc and switched in favour of it. Lemme know your results (A comment in me blog would be awesome as well).

    You're welcome :P
    bLueFrogX's Blog - Random Ramblings of a NEET Techie ★

  14. #14
    SitePoint Addict KJedi's Avatar
    Join Date
    Sep 2005
    Location
    Ukraine, Nikolaev
    Posts
    231
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I tried to use your function and it turned out, that:
    Code:
    mysql> select LEVENSHTEIN('knight', 'knihgt');
    +---------------------------------+
    | LEVENSHTEIN('knight', 'knihgt') |
    +---------------------------------+
    |                               4 |
    +---------------------------------+
    1 row in set (0.01 sec)
    And
    Code PHP:
    echo levenshtein('knight', 'knihgt');
    gives
    Code:
    2
    It does not doubles it:
    Code:
    mysql> select LEVENSHTEIN('knight', 'knihgt_');
    +----------------------------------+
    | LEVENSHTEIN('knight', 'knihgt_') |
    +----------------------------------+
    |                                5 |
    +----------------------------------+
    1 row in set (0.01 sec)
    PHP's levenshtein gives 3...

    And how to use that UDF? Maybe it gives correct results...

  15. #15
    SitePoint Member
    Join Date
    Aug 2008
    Posts
    1
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I realise this is an old thread, but this is where the Google Search brought me.

    Josh Drew published a Levenshtein UDF long ago:

    Code:
    mysql> select levenshtein('knight', 'knihgt');
    +---------------------------------+
    | levenshtein('knight', 'knihgt') |
    +---------------------------------+
    |                               2 | 
    +---------------------------------+
    I've just updated it to do Damerau-Levenshtein (tests for transpositions):
    Code:
    mysql> select damlev('knight', 'knihgt');
    +----------------------------+
    | damlev('knight', 'knihgt') |
    +----------------------------+
    |                          1 | 
    +----------------------------+
    The source code and a couple of performance-hacked versions are on my blog:
    No URLs N00b!
    ...which you'll have to find from the link in my profile's 'Contact Info'...
    If you see it in there, and you think it's useful AND you're not a SitePoint n00b, would you mind posting the URL on this thread? Thanks, that would be marvellous.

    The code is a breeze to compile and install (maybe I got lucky on Slackware), let me know how you get on with it.
    Last edited by Sean4u; Aug 27, 2008 at 13:03. Reason: Where's my sig?


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •