Using Intl Transliterator functions solved the character set problems.
And using wildcards and metaphone helped a lot with finding matching fields.
MySQL LIKE is case-insensitive, so no need for the input to have the same case as the field in the database.
All fairly good so far in terms of reducing false negatives. As long as the input text is reasonably close, wrong consonants being an exception, the chances of finding a field in the database are good, sometimes too good.
For example, if Tehran is entered, what are the chances that Terni, Trani and Turin will be relevant?
If albeit short âmanâ was entered, how many of these might have been what was being search for?
Mainz, Mannheim, MĂźlheim an der Ruhr, Giugliano in Campania, Manfredonia, Marano di Napoli, Milan, Montesilvano, Cambuslang, Dumbarton, Kilmarnock, Newton Mearns, Madison, Manchester, McAllen, Miami Gardens, Midland, Norman, Pompano Beach, âAmran
Both the levenshtein and similar_text functions seem made for the job of removing false positves from the returned results.
In a way, levensthein measures how much is âwrongâ and similar_text measures how much is ârightâ.
Unfortunately, AFAIK they do not have âmulti-byte safeâ equivalents. Once again, Transliterator to the rescue.
They also treat differences of case as âwrongâ, mb_strtolower works well at solving this problem.
A peculiarity of similar_text is that âthe pointerâ moves down the string, so what is being compared to what can be different, hence the âflipsâ in the code below.
Now to where I am currently stuck.
The results of both levenshtein and similar_text vary depending on string length and Iâm having a hard time coming up with anything that isnât an arbitrary compromise.
For example,
Man â Manfredonia might very well be a âtrueâ match.
But because the entered text was short, levenshtein is 8 (relatively high) and similar_text is 3 - 42% (relatively low).
Man â McAllen is likely a âfalseâ match.
levenshtein is 5 (a less âwrongâ value) and similar_text is 2 - 40% (less âtrueâ, but not by much).
I share my test page here, such as it is, and welcome any suggestions
<?php
declare(strict_types=1);
error_reporting(E_ALL);
ini_set('display_errors', 'true');
/* custom functions */
function return_clean_string(string $string): string {
$clean_string = str_replace(" ", " ", trim($string));
return $clean_string;
}
function return_iconv(string $string): string {
$converted = iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", $string);
return $converted;
}
function return_mb_convert_encoding(string $string): string {
$converted = mb_convert_encoding($string, "ASCII");
return $converted;
}
function return_transliterated(string $string): string {
$transliterator = transliterator_create("Latin-ASCII");
$normalized = normalizer_normalize($string);
$transliterated_string = transliterator_transliterate($transliterator, $normalized);
return $transliterated_string;
}
/* define variables */
$first_string = "MuḊÄfaz̧at al Ḩudaydah";
$second_string = "ĂstergĂśtland";
$first_iconv = "";
$second_iconv = "";
$first_mb_convert_encoding = "";
$second_mb_convert_encoding = "";
$first_transliterated = "";
$second_transliterated = "";
$first_grapheme_extract = "";
$second_grapheme_extract = "";
$first_soundex = "";
$second_soundex = "";
$first_transliterated_soundex = "";
$second_transliterated_soundex = "";
$first_metaphone = "";
$second_metaphone = "";
$first_transliterated_metaphone = "";
$second_transliterated_metaphone = "";
$levenshtein = "";
$levenshtein_flip = "";
$similar_text = "";
$similar_text_flip = "";
$percent = "";
$percent_flip = "";
if (isset($_POST['first_string']) && isset($_POST['second_string']) ) {
$first_string = return_clean_string($_POST['first_string']);
$second_string = return_clean_string($_POST['second_string']);
$first_iconv = return_iconv($first_string);
$second_iconv = return_iconv($second_string);
$first_mb_convert_encoding = return_mb_convert_encoding($first_string);
$second_mb_convert_encoding = return_mb_convert_encoding($second_string);
if (class_exists('Transliterator')) {
$first_transliterated = return_transliterated($first_string);
$second_transliterated = return_transliterated($second_string);
$first_grapheme_extract = grapheme_extract($first_string, 100);
$second_grapheme_extract = grapheme_extract($second_string, 100);
} else {
$first_transliterated = "Intl is not enabled";
$second_transliterated = "Intl is not enabled";
$first_grapheme_extract = "Intl is not enabled";
$second_grapheme_extract = "Intl is not enabled";
}
$first_soundex = soundex($first_string);
$second_soundex = soundex($second_string);
if (class_exists('Transliterator')) {
$first_transliterated_soundex = soundex($first_transliterated);
$second_transliterated_soundex = soundex($second_transliterated);
} else {
$first_transliterated_soundex = "Intl is not enabled";
$second_transliterated_soundex = "Intl is not enabled";
}
$first_metaphone = metaphone($first_string);
$second_metaphone = metaphone($second_string);
if (class_exists('Transliterator')) {
$first_transliterated_metaphone = metaphone($first_transliterated);
$second_transliterated_metaphone = metaphone($second_transliterated);
} else {
$first_transliterated_metaphone = "Intl is not enabled";
$second_transliterated_metaphone = "Intl is not enabled";
}
$levenshtein = levenshtein($first_string, $second_string);
$levenshtein_flip = levenshtein($second_string, $first_string);
$similar_text = similar_text($first_string, $second_string, $percent);
$similar_text_flip = similar_text($second_string, $first_string, $percent_flip);
}
?>
<html>
<head><title>Testing</title>
<style>
.st-t {
display: table-row;
}
.st-tc {
display: table-cell;
padding-left: 0.5em;
}
</style>
</head>
<body>
<h1>Testing</h1>
<form action="#" method="POST">
<input id="first_string" name="first_string" type="text" value="<?php echo $first_string; ?>" />
<br />
<input id="second_string" name="second_string" type="text" value="<?php echo $second_string; ?>" />
<br />
<input type="submit">
</form>
<hr /><!-- hr -->
<div><?php echo "first_strlen " . strlen($first_string); ?></div>
<div><?php echo "first_mb_strlen " . mb_strlen($first_string); ?></div>
<br />
<div><?php echo "second_strlen " . strlen($second_string); ?></div>
<div><?php echo "second_mb_strlen " . mb_strlen($second_string); ?></div>
<br />
<div><?php echo "first_strtolower " . strtolower($first_string); ?></div>
<div><?php echo "first_mb_strtolower " . mb_strtolower($first_string); ?></div>
<br />
<div><?php echo "second_strtolower " . strtolower($second_string); ?></div>
<div><?php echo "second_mb_strtolower " . mb_strtolower($second_string); ?></div>
<hr /><!-- hr -->
<div><?php echo "first_iconv " . $first_iconv; ?></div>
<div><?php echo "second_iconv " . $second_iconv; ?></div>
<br />
<div><?php echo "first_mb_convert_encoding " . $first_mb_convert_encoding; ?></div>
<div><?php echo "second_mb_convert_encoding " . $second_mb_convert_encoding; ?></div>
<br />
<div><?php echo "first_transliterated " . $first_transliterated; ?></div>
<div><?php echo "second_transliterated " . $second_transliterated; ?></div>
<hr /><!-- hr -->
<div><?php echo "first_grapheme_extract " . $first_grapheme_extract; ?></div>
<div><?php echo "second_grapheme_extract " . $second_grapheme_extract; ?></div>
<br />
<div><?php echo "first_soundex " . $first_soundex; ?></div>
<div><?php echo "first_transliterated_soundex " . $first_transliterated_soundex; ?></div>
<div><?php echo "second_soundex " . $second_soundex; ?></div>
<div><?php echo "second_transliterated_soundex " . $second_transliterated_soundex; ?></div>
<br />
<div><?php echo "first_metaphone " . $first_metaphone; ?></div>
<div><?php echo "first_transliterated_metaphone " . $first_transliterated_metaphone; ?></div>
<div><?php echo "second_metaphone " . $second_metaphone; ?></div>
<div><?php echo "second_transliterated_metaphone " . $second_transliterated_metaphone; ?></div>
<hr /><!-- hr -->
<div><?php echo "levenshtein " . $levenshtein; ?></div>
<div><?php echo "levenshtein_flip " . $levenshtein_flip; ?></div>
<br />
<div class="st-t"><span><?php echo "similar_text </span><span class=\"st-tc\">" . $similar_text . "</span><span class=\"st-tc\">" . $percent . " %"; ?></span></div>
<div class="st-t"><span><?php echo "similar_text_flip </span><span class=\"st-tc\">" . $similar_text_flip . "</span><span class=\"st-tc\">" . $percent_flip . " %"; ?></span></div>
<hr /><!-- hr -->
</body></html>