How does suggestive search account for misspelled words?

I’ve seen this when searching on google. Even if a person misspells arthritis, for example, and types in arthritis instead, Google figured out what the person means. I need to implement something similar on my website. Roughly how is this accomplished in the database? Do websites create a common search table in a database that includes a column of mis-spelled variations?

Thanks!

If and only if your platform is Linux that uses PHP and MySql or similar databases…

Try searching for sounds like or maybe soundex.

AFAIK, most searches do not depend on adding fields with misspellings, but reduce words into their components. i.e. “stemmer”, “lexeme”, “phoneme”.

I just tried some MySQL FULLTEXT queries, but although they found exact matches very well, disappointingly they did not return misspelled words. However, as John_Betong posted, soundex is worth a try.

IMHO, PostgreSQL (and probably other databases as well) has more advanced text search features

1 Like

This is just my thinking but wouldn’t google do something more clever than just looking at similar words. At a guess i would have thought they would look at what the user enters, what they ultimately click on and how this compares to other searches. E.g If 200,000 people type is arthritis and click on a webpage that contains that word surely it would make sense that the 30 people who type arthitis accidentally and still click on a webpage about arthritis or realise and retype the word correctly and search again and then click on a webpage about arthritis actually wanted arthritis as the search term. Over time and millions of interactions google could build a very powerful system to work out what someone actually means and offer the ‘did you mean XXX’. Google bombing is similar in some respects where enough links with certain keywords pointed to a certain webpage and eventually google assumed that is what people wanted to see https://en.wikipedia.org/wiki/Google_bomb.

I don’t think this would be easy to replicate on a single site. But i could be entirely off in my thinking.

Have you had a chance to experiment with SOUNDEX yet? It has limitations, but has some potential too. eg.

mysql> SELECT SOUNDEX('transit') AS `root`
    ->  , SOUNDEX('transition') AS `verb`
    ->  , SOUNDEX('transitory') AS `adjective`
    ->  , SOUNDEX('transet') AS `vowel`
    ->  , SOUNDEX('trrannsitt') AS `consonants`
    ->  , SOUNDEX('dransit') AS `first`
    ->  , SOUNDEX('trnst') AS `novowels`
    ->  , SOUNDEX('atransit') AS `xvowel`
    ->  , SOUNDEX('you') AS `notme`
    ->  , SOUNDEX('yew') AS `tree`
    ->  , SOUNDEX('ewe') AS `bah`;
+-------+--------+-----------+-------+------------+-------+----------+--------+-------+------+------+
| root  | verb   | adjective | vowel | consonants | first | novowels | xvowel | notme | tree | bah  |
+-------+--------+-----------+-------+------------+-------+----------+--------+-------+------+------+
| T6523 | T65235 | T65236    | T6523 | T6523      | D6523 | T6523    | A36523 | Y000  | Y000 | E000 |
+-------+--------+-----------+-------+------------+-------+----------+--------+-------+------+------+

Notice that the value is the first letter of the word, subsequent vowels have no significance, and the subsequent consonants are represented as digits.

I think in combination with SUBSTRING a very crude imperfect “misspelling search” could be put together. Maybe use with PHP levenshtein? Though TBH I really think a better approach would be to use a better tool for the job (eg. PostgreSQL) instead of pushing the limits of a database not best suited for such work.

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.