Search - Casting a wider net with wildcards and metaphone

Having found a solution to the character encoding problem

time to get to the other problems

As was previously mentioned the goal is to find a good balance between true / false positives / negatives

Thanks to using the Transliteration class, text input (with extra whitespace removed for good measure) that matched the names in the database would be returned in the search results.

But what about misspelled names, i.e. if one or more extra or missing or wrong characters were entered?

Having looked at John’s code

the use of metaphone() looked very promising as a way to get results that were “close”.

The metaphone function reduces words to their “sounds”. It is somewhat similar to soundex() and it may be due to my lack of knowledge using it, but experiments with soundex() were less than hoped for.
I also experimented with grapheme()- which reduces words based on spelling rather than sounds - but was again not happy with the results.
(I surmise that soundex() and grapheme() are faster at searching for exact matches than matching string characters would be.)

As much over-simplified descriptions, the soundex function reduces words to a letter and 3 numbers that represent sounds, the metaphone function reduces words to the leading vowel, if the word starts with one, followed by consonant sounds, ignoring any vowels within the word.
One problem with metaphone being that it also ignores some characters, the Transliteration class to the rescue once again eg.

Đà NẾng NNK Da Nang TNNK

That is, if misspelled “De Ning” was entered - “TNNK” - it would not match the “Đà Nẵng” metaphone “NNK” in the database.

I ran some ALTER TABLE queries to change the table schemas eg.

`cities`
	(
	`name` VARCHAR(100) NOT NULL
	, `state_name` VARCHAR(100) NOT NULL
	, `phoneme` VARCHAR(50) NOT NULL
	, `transliteration` VARCHAR(100) NOT NULL
	, `transliteration_phoneme` VARCHAR(50) NOT NULL
	, PRIMARY KEY (`name`, `state_name`)
	);

and populated the newly added fields.

Another problem, when partial words are entered as input.
John’s code also takes advantage of MySQLs wildcards that can be used in LIKE conditions. eg.
If “valley” was entered, the code changes it to "%valley% and will match “west valley”, “valley city” and “west valley city”. The “%” is MySQLs wildcard for “nothing or anything”
This is a big improvement over not returning anything, but may result in false positives.

Similarly, changing to “%v%a%l%l%e%y%” will match anything with that sequence of characters, with no or any other characters between them.
Again more fields may match, including more false positives.

Many thanks for the updated city table, if other users have a table of data to search then please forward in a PM and I will endeavort to update the search routine.

***Online Demo with source code ***

Off topic
I have used PDO and some PHP7 Strict mode and apologise in advance for any PHP 5.X.X users. I need to create another topic to ask how to handle the differerent PHP versions.

Update
I had a brief search and discovered both Google and Bing introduced a “Live Search” a few years ago and surprised they do not offer the option in their latest browsers?

Hmmm. AFAIK it is not a browser thing, but happens on the Google search home page.

At least for me, when I start typing the search phrase the input “auto suggests” and the page shows results based on what’s in the input at the time. Sometimes it simply stops when I’m done, other times I need to “enter to search”

Maybe it’s a “can I use” type of feature that depends on if it’s supported? IIRC it was rolled out in stages, so maybe there are some areas where it hasn’t been rolled out to?

Using Intl Transliterator functions solved the character set problems.
And using wildcards and metaphone helped a lot with finding matching fields.
MySQL LIKE is case-insensitive, so no need for the input to have the same case as the field in the database.

All fairly good so far in terms of reducing false negatives. As long as the input text is reasonably close, wrong consonants being an exception, the chances of finding a field in the database are good, sometimes too good.

For example, if Tehran is entered, what are the chances that Terni, Trani and Turin will be relevant?
If albeit short “man” was entered, how many of these might have been what was being search for?
Mainz, Mannheim, Mülheim an der Ruhr, Giugliano in Campania, Manfredonia, Marano di Napoli, Milan, Montesilvano, Cambuslang, Dumbarton, Kilmarnock, Newton Mearns, Madison, Manchester, McAllen, Miami Gardens, Midland, Norman, Pompano Beach, ‘Amran

Both the levenshtein and similar_text functions seem made for the job of removing false positves from the returned results.
In a way, levensthein measures how much is “wrong” and similar_text measures how much is “right”.

Unfortunately, AFAIK they do not have “multi-byte safe” equivalents. Once again, Transliterator to the rescue.
They also treat differences of case as “wrong”, mb_strtolower works well at solving this problem.
A peculiarity of similar_text is that “the pointer” moves down the string, so what is being compared to what can be different, hence the “flips” in the code below.

Now to where I am currently stuck.
The results of both levenshtein and similar_text vary depending on string length and I’m having a hard time coming up with anything that isn’t an arbitrary compromise.

For example,
Man → Manfredonia might very well be a “true” match.
But because the entered text was short, levenshtein is 8 (relatively high) and similar_text is 3 - 42% (relatively low).
Man → McAllen is likely a “false” match.
levenshtein is 5 (a less “wrong” value) and similar_text is 2 - 40% (less “true”, but not by much).

I share my test page here, such as it is, and welcome any suggestions

<?php
declare(strict_types=1);

error_reporting(E_ALL);
ini_set('display_errors', 'true');

/* custom functions */
function return_clean_string(string $string): string {
  $clean_string = str_replace("  ", " ", trim($string));
  return $clean_string;
}
function return_iconv(string $string): string {
  $converted = iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", $string);
  return $converted;
}
function return_mb_convert_encoding(string $string): string {
  $converted = mb_convert_encoding($string, "ASCII");
  return $converted;
}
function return_transliterated(string $string): string {
  $transliterator = transliterator_create("Latin-ASCII");
  $normalized = normalizer_normalize($string);
  $transliterated_string = transliterator_transliterate($transliterator, $normalized);
  return $transliterated_string;
}

/* define variables */
$first_string = "MuḊāfaz̧at al Ḩudaydah";
$second_string = "Östergötland";
	
$first_iconv = "";
$second_iconv = "";
$first_mb_convert_encoding = "";
$second_mb_convert_encoding = "";
$first_transliterated = "";
$second_transliterated = "";

$first_grapheme_extract = "";
$second_grapheme_extract = "";
$first_soundex = "";
$second_soundex = "";
$first_transliterated_soundex = "";
$second_transliterated_soundex = "";
$first_metaphone = "";
$second_metaphone = "";
$first_transliterated_metaphone = "";
$second_transliterated_metaphone = "";

$levenshtein = "";
$levenshtein_flip = "";
$similar_text = "";
$similar_text_flip = "";
$percent = "";
$percent_flip = "";

if (isset($_POST['first_string']) && isset($_POST['second_string']) ) {
  $first_string = return_clean_string($_POST['first_string']);
  $second_string = return_clean_string($_POST['second_string']);

  $first_iconv = return_iconv($first_string);
  $second_iconv = return_iconv($second_string);
  $first_mb_convert_encoding = return_mb_convert_encoding($first_string);
  $second_mb_convert_encoding = return_mb_convert_encoding($second_string);
  if (class_exists('Transliterator')) {
    $first_transliterated = return_transliterated($first_string);
    $second_transliterated = return_transliterated($second_string);

    $first_grapheme_extract = grapheme_extract($first_string, 100);
    $second_grapheme_extract = grapheme_extract($second_string, 100);
  } else {
    $first_transliterated = "Intl is not enabled";
    $second_transliterated = "Intl is not enabled";

    $first_grapheme_extract = "Intl is not enabled";
    $second_grapheme_extract = "Intl is not enabled";
  }
  $first_soundex = soundex($first_string);
  $second_soundex = soundex($second_string);
  if (class_exists('Transliterator')) {
    $first_transliterated_soundex = soundex($first_transliterated);
    $second_transliterated_soundex = soundex($second_transliterated);
  } else {
    $first_transliterated_soundex = "Intl is not enabled";
    $second_transliterated_soundex = "Intl is not enabled";
  }
  $first_metaphone = metaphone($first_string);
  $second_metaphone = metaphone($second_string);
  if (class_exists('Transliterator')) {
    $first_transliterated_metaphone = metaphone($first_transliterated);
    $second_transliterated_metaphone = metaphone($second_transliterated);
  } else {
    $first_transliterated_metaphone = "Intl is not enabled";
    $second_transliterated_metaphone = "Intl is not enabled";
  }

  $levenshtein = levenshtein($first_string, $second_string);
  $levenshtein_flip = levenshtein($second_string, $first_string);
  $similar_text = similar_text($first_string, $second_string, $percent);
  $similar_text_flip = similar_text($second_string, $first_string, $percent_flip);
}
?>
<html>
<head><title>Testing</title>
<style>
.st-t {
  display: table-row;
}
.st-tc {
  display: table-cell;
  padding-left: 0.5em;
}
</style>
</head>
<body>
<h1>Testing</h1>
<form action="#" method="POST">
<input id="first_string" name="first_string" type="text" value="<?php echo $first_string; ?>" />
<br />
<input id="second_string" name="second_string" type="text" value="<?php echo $second_string; ?>" />
<br />
<input type="submit">
</form>
<hr /><!-- hr -->
<div><?php echo "first_strlen " .  strlen($first_string); ?></div>
<div><?php echo "first_mb_strlen " . mb_strlen($first_string); ?></div>
<br />
<div><?php echo "second_strlen " .  strlen($second_string); ?></div>
<div><?php echo "second_mb_strlen " . mb_strlen($second_string); ?></div>
<br />
<div><?php echo "first_strtolower " .  strtolower($first_string); ?></div>
<div><?php echo "first_mb_strtolower " . mb_strtolower($first_string); ?></div>
<br />
<div><?php echo "second_strtolower " .  strtolower($second_string); ?></div>
<div><?php echo "second_mb_strtolower " . mb_strtolower($second_string); ?></div>
<hr /><!-- hr -->
<div><?php  echo "first_iconv " . $first_iconv; ?></div>
<div><?php  echo "second_iconv " . $second_iconv; ?></div>
<br />
<div><?php  echo "first_mb_convert_encoding " . $first_mb_convert_encoding; ?></div>
<div><?php  echo "second_mb_convert_encoding " . $second_mb_convert_encoding; ?></div>
<br />
<div><?php echo "first_transliterated " . $first_transliterated; ?></div>
<div><?php echo "second_transliterated " . $second_transliterated; ?></div>
<hr /><!-- hr -->
<div><?php  echo "first_grapheme_extract " . $first_grapheme_extract; ?></div>
<div><?php  echo "second_grapheme_extract " . $second_grapheme_extract; ?></div>
<br />
<div><?php  echo "first_soundex " . $first_soundex; ?></div>
<div><?php  echo "first_transliterated_soundex " . $first_transliterated_soundex; ?></div>
<div><?php  echo "second_soundex " . $second_soundex; ?></div>
<div><?php  echo "second_transliterated_soundex " . $second_transliterated_soundex; ?></div>
<br />
<div><?php echo "first_metaphone " . $first_metaphone; ?></div>
<div><?php echo "first_transliterated_metaphone " . $first_transliterated_metaphone; ?></div>
<div><?php echo "second_metaphone " . $second_metaphone; ?></div>
<div><?php echo "second_transliterated_metaphone " . $second_transliterated_metaphone; ?></div>
<hr /><!-- hr -->
<div><?php echo "levenshtein " . $levenshtein; ?></div>
<div><?php echo "levenshtein_flip " . $levenshtein_flip; ?></div>
<br />
<div class="st-t"><span><?php echo "similar_text </span><span class=\"st-tc\">" . $similar_text . "</span><span class=\"st-tc\">" . $percent . " %"; ?></span></div>
<div class="st-t"><span><?php echo "similar_text_flip </span><span class=\"st-tc\">" . $similar_text_flip . "</span><span class=\"st-tc\">" . $percent_flip . " %"; ?></span></div>
<hr /><!-- hr -->
</body></html>

The search routine uses LIKE and input text syllables are separated by spaces.

A text search for valley would have a single parameter %valley% and not %v%a%l%l%e%y%

A text search for valley uta would have two parameters %valley%%uta and not %v%a%l%l%e%y%%u%%t%%a%

I have updated the display to show the sql and the parameters:

$sql:   SELECT * FROM cities2 WHERE concat(name,state_name) LIKE ?  AND concat(name,state_name) LIKE ? 
Params:   %vall%%uta%

Sorry, I must have left out what I tried.

function return_wildcarded_loose(string $string): string {
  $string_array = str_split($string);
  $wildcarded_str = "%" . implode("%", $string_array) . "%";
  return $wildcarded_str;
}

It works great for when the name is missing a letter. eg.
“Teas” will match “Texas” - but it also matches a lot of other names too, many, if not most, of which are false positives. The problem is more severe the shorter the input string is.

I seldom use Google Chrome and prefer FireFox, Opera and Vivaldi.

My default Google Chrome browser is https://www.google.co.th/ and set to use English language. This feature does not work… but delighted to say that it does work with https://www.google.com/

The search results display are very slow, no doubt to to the limitation of the service provider :frowning:

Edit:
I forgot to mention that I really like your demo and curious to know how you managed to capture the animation?

I have uploaded your Mittineague’s Test Page with slight modifications but unfortunately the input text fields are not set to search the database :frowning:

That is what I am using to compare one input to another to experiment with different functions - before editing the actual search form file that does use the database.

I find it easier to edit a smaller file than I do with one that is more complex. I find it easier to get a feel for things when they’re more isolated from other code that might be affecting them and less work to remove something that doesn’t work that way.

Once I get something I’m fairly happy with I’ll post it here. But right now I’m still foggy on how to use levenshtein and similar_text to advantage.

1 Like

I got side-tracked with getting into some redesign stuff for a while.
And it took me a while to figure a few things out with this script.

I zipped everything up so all that should be needed is

  • install and enable Intl if it isn’t already
  • create an empty database to hold the data
  • import the sql file
  • unzip the folders / files to your localhost
  • edit the config’s db cnx file

I cleaned some files up a bit, some are still very messy, but here it is

search.zip (103.2 KB)

I downloaded and unpacked the zip file onto my LocalHost (running PHP 7) and it looks good. I have yet to install the the “International Functions” and also the database.

I uploaded the script to my site but unfortunately I recently had MySql problems which I could only resolve by installing an old (Php 5.5) backup and having big problems trying to get up to date and install PHP 7, MySql, etc Consequently your demo is not working online. I hope to get PHP 7.0 installed soon.

Many thanks for the Softonic link, unfortunately is for Windows only :frowning:

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.