Continuing the discussion from Is this correct code?:
This discussion caught my interest and because it has been a while since I worked with iconv() I gave it another go.
Instead of using a database I created arrays to use as the strings source.
I created $usa_cities_arr
from https://simple.wikipedia.org/wiki/List_of_United_States_cities_by_population (272 pairs) and also
$spain_cities_arr = array(
1 => array("Madrid", "Madrid")
,2 => array("Barcelona", "Catalonia")
,3 => array("Valencia", "Valencia")
,4 => array("Seville", "Andalusia")
,5 => array("Zaragoza", "Aragon")
,6 => array("MĂĄlaga", "Andalusia")
,7 => array("Murcia", "Murcia")
,8 => array("Palma de Mallorca", "Balearic Islands")
,9 => array("Las Palmas de Gran Canaria", "Canary Islands")
,10 => array("Bilbao", "Basque Country")
);
$germany_cities_arr = array(
1 => array("Berlin", "Berlin")
,2 => array("Hamburg", "Hamburg")
,3 => array("Munich", "Bavaria")
,4 => array("Cologne", "North Rhine-Westphalia")
,5 => array("Frankfurt", "Hesse")
,6 => array("Essen", "North Rhine-Westphalia")
,7 => array("Dortmund", "North Rhine-Westphalia")
,8 => array("Stuttgart", "Baden-WĂŒrttemberg")
,9 => array("DĂŒsseldorf", "North Rhine-Westphalia")
,10 => array("Bremen", "Bremen")
,11 => array("Hanover", "Lower Saxony")
,12 => array("Duisburg", "North Rhine-Westphalia")
,13 => array("Nuremberg", "Bavaria")
,14 => array("Leipzig", "Saxony")
,15 => array("Dresden", "Saxony")
,16 => array("Bochum", "North Rhine-Westphalia")
,17 => array("Wuppertal", "North Rhine-Westphalia")
,18 => array("Bielefeld", "North Rhine-Westphalia")
,19 => array("Bonn", "North Rhine-Westphalia")
,20 => array("Mannheim", "Baden-WĂŒrttemberg")
,21 => array("Karlsruhe", "Baden-WĂŒrttemberg")
,22 => array("Gelsenkirchen", "North Rhine-Westphalia")
,23 => array("Wiesbaden", "Hesse")
,24 => array("MĂŒnster", "North Rhine-Westphalia")
,25 => array("Mönchengladbach", "North Rhine-Westphalia")
,26 => array("Chemnitz", "Saxony")
,27 => array("Augsburg", "Bavaria")
,28 => array("Braunschweig", "Lower Saxony")
,29 => array("Aachen", "North Rhine-Westphalia")
,30 => array("Krefeld", "North Rhine-Westphalia")
,31 => array("Halle", "Saxony-Anhalt")
,32 => array("Kiel", "Schleswig-Holstein")
,33 => array("Magdeburg", "Saxony-Anhalt")
,34 => array("Oberhausen", "North Rhine-Westphalia")
,35 => array("LĂŒbeck", "Schleswig-Holstein")
,36 => array("Freiburg", "Baden-WĂŒrttemberg")
,37 => array("Hagen", "North Rhine-Westphalia")
,38 => array("Erfurt", "Thuringia")
,39 => array("Kassel", "Hesse")
,40 => array("Rostock", "Mecklenburg-Vorpommern")
,41 => array("Mainz", "Rhineland-Palatinate")
,42 => array("Hamm", "North Rhine-Westphalia")
,43 => array("SaarbrĂŒcken", "Saarland")
,44 => array("Herne", "North Rhine-Westphalia")
,45 => array("MĂŒlheim an der Ruhr", "North Rhine-Westphalia")
,46 => array("Solingen", "North Rhine-Westphalia")
,47 => array("OsnabrĂŒck", "Lower Saxony")
,48 => array("Ludwigshafen am Rhein", "Rhineland-Palatinate")
,49 => array("Leverkusen", "North Rhine-Westphalia")
,50 => array("Oldenburg", "Lower Saxony")
);
then merged them
$all_cities_arr = array_merge($usa_cities_arr, $spain_cities_arr, $germany_cities_arr);
One problem was that passing words to iconv() errorred because it saw strings that contain mostly ASCII as ASCII even if one character was UTF-8
Using String functions on the words caused problems because some UTF-8 characters were 2 bytes and the String function split the one character into two
Realizing this made me think of the mb_ (multibyte) functions
After much trial and error I came up with
function check_levenshtein_similar($haystack, $needle) {
$needle_length = strlen($needle);
$haystack_length = strlen($haystack);
if ($needle_length >= $haystack_length) {
$levenshtein_distance = levenshtein($needle, $haystack);
similar_text($needle, $haystack, $similar_text_percent);
if ( ($levenshtein_distance <= 2) || ( floor($similar_text_percent) >= 50) ) {
return true;
}
} else {
$offset = 0;
while ($needle_length <= $haystack_length) {
$haystack_substr = substr($haystack, $offset, $needle_length);
$levenshtein_distance = levenshtein($needle, $haystack_substr);
similar_text($needle, $haystack_substr, $similar_text_percent);
if ( ($levenshtein_distance <= 2) || ( floor($similar_text_percent) >= 50) ) {
return true;
} else {
$offset += 1;
$haystack_length -= 1;
}
}
}
return false;
}
function mbStringToArray ($string) {
$strlen = mb_strlen($string, "ASCII");
while ($strlen) {
$array[] = mb_substr($string,0,1,"ASCII");
$string = mb_substr($string,1,$strlen,"ASCII");
$strlen = mb_strlen($string, "ASCII");
}
return $array;
}
function find_city($haystack, $needle) {
foreach ($haystack as $city_state_arr) {
$split_city_str = mbStringToArray($city_state_arr[0]);
$split_state_str = mbStringToArray($city_state_arr[1]);
$converted_city_str = "";
$converted_state_str = "";
foreach ($split_city_str as $city_letter) {
if ( !mb_check_encoding($city_letter, "ASCII") ) {
$city_letter = mb_convert_encoding($city_letter, "ASCII", "UTF-8");
$converted_city_str .= $city_letter;
} else {
$converted_city_str .= $city_letter;
}
}
foreach ($split_state_str as $state_letter) {
if ( !mb_check_encoding($state_letter, "ASCII") ) {
$state_letter = mb_convert_encoding($state_letter, "ASCII", "UTF-8");
$converted_state_str .= $state_letter;
} else {
$converted_state_str .= $state_letter;
}
}
$city_state_arr[0] = mb_convert_encoding($city_state_arr[0], "UTF-8", "ASCII");
$city_state_arr[1] = mb_convert_encoding($city_state_arr[1], "UTF-8", "ASCII");
if ( ( stripos($converted_city_str, $needle) !== false )
|| ( stripos($converted_state_str, $needle) !== false )
|| ( check_levenshtein_similar($converted_city_str, $needle) )
|| ( check_levenshtein_similar($converted_state_str, $needle) ) ) {
echo $city_state_arr[0] . " " . $city_state_arr[1] . "<br />";
}
}
}
$needle
is an unsafe string from a form. So it would need to be processed for use in the wild.
And it feels hackish. Some things could be better named, the code could be more elegant, and the levenshtein()
and similar_text()
values need tweaking to reduce false positives when $needle
is a short string.
I canât help thinking that mapping 2 byte UTF-8 characters to ASCII replacements would be easier, but I wanted to tackle it without for the learning experience.