Can't Match Some (Foreign) Films After Table Convert to UTF8 (General or Unicode)

I updated our film table to UTF-8 Charset with utf8_unicode_ci (I changed to utf8_general_ci to see if any difference) and can’t figure it out.

It seems that any film with a foreign character in its title will not appear in MATCH queries. They used to under the older character set, but no longer.

So, let’s say I want to search for Y Tu Mamá También, I should be able to search “y tu mama” or “mamá también” and see some results. But not getting any results now.

Another one is México Bárbaro. I used to be able to get that to appear with just “mexico” as my search term, but now I can’t even with “mexico barbaro”.

All feedback appreciated!

Cheers!
Ryan

Try this:

Thanks for the feedback and I must apologize. I actually discovered that my normal search page dealt with the word fine, while my AJAX search (using json_encode) was the one having the issue and leaving the film off while building the json.

Here is my new question. This table/column is UTF-8 Charset and utf8_general_ci collation (at the moment).

The way to fix the issue was to run utf8_encode() before the json_encode(). But why would I need to utf8_encode data coming from a utf8 charset column?

Cheers!
Ryan

The routine I use does not use JSON:

script type="text/javascript">
  function showResult(str)
  {
    if (str.length==0)
    {
      document.getElementById("livesearch").innerHTML="";
      // document.getElementById("livesearch").style.border="0px";
      return;
    }
    if (window.XMLHttpRequest)
    {
      // code for IE7+, Firefox, Chrome, Opera, Safari
      xmlhttp=new XMLHttpRequest();
    }else{  // code for IE6, IE5
      xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
    }
    xmlhttp.onreadystatechange=function() {
      if (xmlhttp.readyState==4 && xmlhttp.status==200)
      {
        document.getElementById("livesearch").innerHTML=xmlhttp.responseText;
        // document.getElementById("livesearch").style.border="1px solid #A5ACB2";
      }
    }
    // xmlhttp.open("GET","db-results.php?q="+str,true);
    xmlhttp.open("POST","db-results.php?q="+str,true);
    xmlhttp.send();
  }
</script>

But why would I need to utf8_encode data coming from a utf8 charset column?

I have had a great deal of problems with charsets and try to just use UTF-8

Yeah, I need to figure out what’s going on.

For some reason one of my columns that is UTF-8, if I try to utf8_encode the column it creates strange characters, showing that it is already UTF-8. But if I try to utf8_encode another column, it works as it should, suggesting that the data is not UTF-8 though in a UTF-8 column.

Need to figure this out.

I’m using the typeahead search plugin.

Cheers
Ryan

I’ve been busy with UTF-8 lately myself.
I’m still not sure what’s happening, and believe me, I’ve looked.

But I’ve found this to work better than iconv;

$city_name = mb_convert_encoding($city_state_vals[0], "UTF-8", "ASCII");
1 Like

Just tried it on a site and it works a treat, many thanks. Those elusive ? have been very annoying to say the least.

I’d still like to know where the change is happening.
I have my database CHARACTER SET and COLLATION both UTF-8
I have my text editor UTF-8 no BOM
My PHP is UTF-8
My HTML is UTF-8
My HTTP headers are UTF-8

For the life of me I can’t think of anything else to set to UTF-8

My best guess is that because PHP has “soft typing” somewhere along the way strings lose the UTF-8 and become ASCII

I have one table with charset UTF8 with utf8_unicode_ci collation. And another table with the same setup.

If I were to do a mysql query from the two tables and combine two columns, one from each table, one would need to be utf8_encode() and the other would not (the other would form the weird characters when you re-encode utf8). I’m not getting it.

I have two functions that are requiring me to run utf8_encode on names with foreign characters, though the character set on the column/table is already utf8 (and utf8_unicode_ci collation).

I’ve instead had more success running this: $title = iconv(‘UTF-8’, ‘ISO-8859-1//TRANSLIT//IGNORE’, $title), to hand the data coming out of a utf-8 column.

Not sure what is wrong with my table, or if I’m missing setting another variable.

All feedback appreciated.
Ryan

Are you sure you want it converted to ASCII ? iconv() is
http://php.net/manual/en/function.iconv.php

string iconv ( string $in_charset , string $out_charset , string $str )

“Y Tu Mamá También” is not ASCII and will display something like “Y Tu Mam? Tambi?n”

Latest version using a list of cities from Yahoo:

http://www.johns-jokes.com/downloads/sp-e/jb-ajax-search/cities/

Shows original city name with funny foreign characters and cleaned city name.

I’m in the same boat. Everything is UTF-8, but it is still creating the weird character encoding errors. I thought the whole point of having everything UTF-8 is so I wouldn’t ever have to use utf8_encode.

Either my table is not really UTF8, though it says it is, or something else is up.

Super frustrated about this, since new additions are having issues with characters like the double-dash, and the angled single quotation mark.

I’m going nuts!

Let you know if I make progress.

Cheers!
Ryan

Alright, so here’s what went down.

I’m using a form on an admin page. The page is charset UTF-8. In the text box I add:

The modern mammal metropolis of Zootopia is a city like no other. Comprised of habitat neighborhoods like ritzy Sahara Square and frigid Tundratown, it’s a melting pot where animals from every environment live together—a place where no matter what you are, from the biggest elephant to the smallest shrew, you can be anything. But when optimistic Officer Judy Hopps (Ginnifer Goodwin) arrives, she discovers that being the first bunny on a police force of big, tough animals isn’t so easy.

I then submit this to a php script to add to table that is UTF-8 Charset, uft8_unicode_ci col.

And it becomes…

The modern mammal metropolis of Zootopia is a city like no other. Comprised of habitat neighborhoods like ritzy Sahara Square and frigid Tundratown, it’s a melting pot where animals from every environment live together—a place where no matter what you are, from the biggest elephant to the smallest shrew, you can be anything. But when optimistic Officer Judy Hopps (Ginnifer Goodwin) arrives, she discovers that being the first bunny on a police force of big, tough animals isn’t so easy.

What the heck? Why are the characters going funky?

Cheers!
Ryan

I would like to know that myself.

Try this with your problem characters added to the string

<?php
$source_string = "Forlì Málaga Münster Sør-Trøndelag Østfold Västra Götaland Skåne";
echo $source_string . "<br/>";
$converted_string = mb_convert_encoding($source_string, "UTF-8", "ASCII");
echo $converted_string;
?>

My knowledge on charsets and multibytes characters is sketchy and searches have indicated that once the problem arises Pandora’s Box suddenly springs open:(

It looks like your page is set to display only UTF-8 characters and you are pasting a string of characters containing hidden (from UTF-8) multibyte characters. Only UTF-8 characters are shown in the (UTF-8 character) text box.

The multibyte information is saved to the database.

After retrieval the hidden multibyte characters require the PHP conv(…) filter before displaying on the UTF-8 web-page.

Numerous complication arise in different formats when trying to search for the stored multibyte strings. Read more about searching problems

Problems in searching arise which I tried to eliminate using these techniques:

Version 001 - Demo - using PHP soundex(…)

Version 002 - Demo - using PHP metaphone(…)

Version 003 - Demo - using AJAX single character search for syllables

… unfortunately all versions have their limitations. To try and eliminate all eventualities data storage and/or display compromises must be made.

I figured it out!

So, everything was UTF-8, except the MySQL connection, which was still thinking iso was being sent. I added this (with identifier):

mysql_set_charset('utf8', $conn);

I place that right after my mysql_connect script and it solved everything.

So much so that I had to remove tons of utf8_encode functions from around the site, since the double-encode was breaking characters.

Now characters seem to pass without issue (both ways) with no strange results.

Cheers
Ryan

1 Like

Did you resolve your search problems or just eliminate ? from rendering to your screen?

Yep!

Typing in “y tu” got it to show, or “mama” or “tambien”. Everything is working again.

Whoop!

1 Like

I’ve finally tracked down the cause of my troubles.

Even though I had my database CHARACTER SET and COLLATION defaulted to UTF-8 I added

$mysqli->set_charset("utf8");

Can’t hurt, right?

Alas, still an odd bug.

I double checked the source files character encoding.
Some were UTF-8 no BOM some were not.
Huh?
I double checked,
Yes, my Notepad++ was set to UTF-8 no BOM

Then it dawned on me.

The files I hadn’t renamed were UTF-8 no BOM sure enough.
BUT the files that I had renamed were not.

I know Windows gives a Warning dialog box when file extensions are changed, and rightly so.
But it gives no warning when only the names are changed.
However, when it copies the file (what renaming really is) it doesn’t bring the character encoding along with it.

I fixed up the source files, TRUNCATEd my tables, INSERTed again, and I no longer need mb_convert_encoding() to be able to get characters like
Münster Forlì Østfold Tromsø Skåne Östergötland Västerås

1 Like

New Year, new operating system? Time to switch to Linux :slight_smile:

1 Like