Character Encoding Issue

bar338 · January 27, 2010, 11:19pm

I’m having a little trouble getting character encodings to work for one of my applications. I have a large directory of names, many of which are foreign. My goal is to get as many of these names to render as possible. I’ve decided to use a utf-8 encoding to get this to work.

I’ve had to do a lot of work and testing to figure out everywhere that I need to change this encoding but I think i’m close. There seems to be just 1-2 places left that could be causing the problem.

An example of the problem. The word Zürich renders as ZÃ¼rich.

In order to get the names into the database I imported them using a csv file. I use a mysql database.

These are the places i’ve made changes/checked to ensure that the character encoding is UTF-8.
Original .csv file is UTF-8
Database is UTF-8
Database table encodings are UTF-8
Webpage Doctype is UTF-8.
Putty SSH Client - UTF-8 encoding.

I’ve viewed the data directly in the mysql database and it renders correctly as Zürich. However, when it is displayed on the website it displays as ZÃ¼rich. This leads me to believe that there is something in the process between the database and the webpage that is messign up the data (because both the database and webpage our UTF-8 encodings). I’ve also tried statically displaying Zürich on the webpage and it displays fine so it must be something else.
Do I need to control the encoding in the SQL query on the database? Can you think of anything else that might be causing this problem to occur?

For reference: here is the doctype of the webpage displaying the data.

<!DOCTYPE html
PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN”
“http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”>

<html xmlns=“http://www.w3.org/1999/xhtml” xml:lang=“en” lang=“en”>
<head>
<title>title</title>
<meta http-equiv=“Content-Type” content=“text/html; charset=UTF-8” />

I really appreciate your help. This has become quite a frustrating mess.

Edit: I’ve narrowed it down more.

As previously mentioned, the character encoding appears to be correct in the database when viewing through putty.
I created a simple sql command to grab the data from the database. Then ran these two lines of code.
$result = $db->query($sql);
$row = $result->fetch_assoc();
print_r($row);
print “Zürich”;

The display I get is:
ZÃ¼rich
Zürich

So the problem is isolated to the database query. Suggestions on what I need to do to fix this?

system · January 27, 2010, 11:42pm

query “SET NAMES encoding” must be executed every time database is used.
right after database connect is a good place
i.e
mysql_connect(host,user,pass);
mysql_select_db(db);
mysql_query(“SET NAMES utf8”);
or other functions if you’re using some fancy libraries

Paul_Wilkins · January 27, 2010, 11:51pm

For further reference: [Connection Character Sets and Collations](Connection Character Sets and Collations)

bar338 · January 27, 2010, 11:57pm

I tried that and it didn’t make a difference. This is what i’m sitting with as far as my query goes right now. I’ve got a couple different attempts to fix the encoding issue built in. The actual connection is done in the constructor of the class. I’m also printing out the encoding as you can see and it prints ‘utf8’. However, the encoding is still messed up and the data does not display properly.

public function query($query)
   {
        mysql_set_charset('utf8',$this->link);
        mysql_query("SET character_set_results = 'utf8', character_set_client = 'utf8', character_set_connection = 'utf8', character_set_database = 'utf8', character_set_server = 'utf8'",$this->link);
     $result = mysql_query($query,$this->link);

      $charset = mysql_client_encoding($this->link);
      print $charset;

      if(!$this->link)
      {
         throw new Exception('Database query error');
      }

      return new DbRowSet($result);
   }

Paul_Wilkins · January 28, 2010, 12:15am

Could you confirm for us what you get from the server when you run the following command?


SHOW VARIABLES LIKE 'c&#37;';

system · January 28, 2010, 12:16am

Well, I can state 2 things.

Everything is set right.
You’re good with debugging. So, continue with it. check everything again.
make a simple code that adds a string to the database and displays it back.
Here is the simple checklist I posted before.

bar338 · January 29, 2010, 8:24pm

I see this:

±-------------------------±---------------------------+
| Variable_name | Value |
±-------------------------±---------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
| completion_type | 0 |
| concurrent_insert | 1 |
| connect_timeout | 10 |
±-------------------------±---------------------------+

One interesting thing I noticed was:

When I visit character_sets_dir | /usr/share/mysql/charsets/

It lists these charsets:

armscii8.xml cp1250.xml cp1256.xml cp850.xml cp866.xml geostd8.xml hebrew.xml Index.xml koi8r.xml latin1.xml latin5.xml macce.xml README
ascii.xml cp1251.xml cp1257.xml cp852.xml dec8.xml greek.xml hp8.xml keybcs2.xml koi8u.xml latin2.xml latin7.xml macroman.xml swe7.xml

No where in this list do i see a xml file for utf8. Is this a problem?

Paul_Wilkins · January 30, 2010, 12:18am

The README in there should say that those are only for single-byte libraries. UTF-8 should be compiled in the strings library instead.

system · January 30, 2010, 5:15am

All these variables are meaningless.
You can live with any values of server settings.
Runtime parameters only matter.

crmalibu · January 30, 2010, 6:02am

Send an http header to declare the charset/encoding.


header('Content-type: text/html;charset=utf-8');

One way to represent ü in unicode is code point 252. In utf-8 encoding, you need byte value 195 followed by 188.

Now, if I guess your declaration is currently something like iso-8859-1, and look at the character table, 195 maps to Ã, and 188 maps to ¼, and that’s what you’re getting. So the string from the db is utf-8 encoded.

Why does the print “Zürich”; work? Because it uses a single byte value of 252 to represent that character, which would be how you do it in iso-8859-*

You can compare with strlen() to see the extra byte.

kyberfabrikken · January 30, 2010, 11:10am

What crmalibu said. Just to elaborate; The http meta-tag is a lie. The browser will completely ignore it, unless there aren’t any http headers with the response. When your page is served through a web server, that is usually the case. So in, other words, it does nothing. Calling header from php does work though.

bar338 · January 30, 2010, 3:39pm

crmalibu:

Send an http header to declare the charset/encoding.
header('Content-type: text/html;charset=utf-8');
One way to represent ü in unicode is code point 252. In utf-8 encoding, you need byte value 195 followed by 188.

Now, if I guess your declaration is currently something like iso-8859-1, and look at the character table, 195 maps to Ã, and 188 maps to ¼, and that’s what you’re getting. So the string from the db is utf-8 encoded.

Why does the print “Zürich”; work? Because it uses a single byte value of 252 to represent that character, which would be how you do it in iso-8859-*

You can compare with strlen() to see the extra byte.

I need a little more clarification to understand this. What exactly do you mean by declaration when you said? “if I guess your declaration is currently something like iso-8859-1” and how do I check this?
Where exactly do I need to use the header line? Does it go in php next to the query? Or does it need to go in the template next to the display of the item?

Thanks for your help.

system · January 30, 2010, 3:57pm

how do I check this?

Use your browser’s menu, View - Encoding and see what browser thinks of page encoding.
You can see header itself by using one of HTTP sniffer tools, like LiveHTTPHeaders FireFox addon, for example

Where exactly do I need to use the header line?

Before any output to the browser.
It belongs not to SQL nor HTML, but HTTP. It is HTTP header, and must be sent to the browser before any output.

bar338 · January 30, 2010, 4:08pm

Using HTTPFox i started the http monitoring and loaded the page with the encoding problem. I then clicked on the heading to have it display all the details. I see the following two references to the charset.

Request Header: Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7
Response Header: Content-Type text/html;charset=utf-8

This is after adding the header line to my code (before anything is rendered). I’m sure it brings us a step closer to a solution, but the problem still exists. :injured:

system · January 30, 2010, 4:16pm

Your import part can spoil the data.
I wonder why don’t you get yourself a test code yet.

crmalibu · January 30, 2010, 11:21pm

post the output of


echo base64_encode($row['foo']);

SpacePhoenix · January 30, 2010, 11:43pm

Have you tried:

mysql_set_charset('utf8',$link);

From the PHP manual page on mysql_set_charset()

Note: This is the preferred way to change the charset. Using mysql_query() to execute SET NAMES … is not recommended.

bar338 · January 31, 2010, 3:14am

Ya that was one of the first thing I tried but it didn’t help. It does appear to set the encoding correctly though which confuses me. I set this before the query, run the query and then print the encoding type usingmysql_client_encoding($this->link). Despite this, the text continues to show the encoding error.

post the output of

echo base64_encode($row['foo']);

Output of base64_encode:
SW5zdGl0dXQgZi4gRW5lcmdpZXRlY2huaWsKTUwgSiAzNgpTb25uZWdnc3RyYXNzZSAzCjgwOTIgWsODwrxyaWNoLCBTd2l0emVybGFuZA==

Out of my own curiosity, what exactly is the purpose of this information?

SpacePhoenix · January 31, 2010, 3:56am

At what point are you using mysql_set_charset(‘utf8’,$link); as it needs to be used the very next line after you select the database which your using.

crmalibu · January 31, 2010, 4:03am

Because now I can use base64_decode() and perfectly replicate that binary string, and be confident about what it is. Without it, there’s a possibility that the data might get changed after it passes through so many software layers(browsers, this forum).

Your data looks like it was double encoded. You should have byte 2 bytes, values 195,188 to represent that character. You have 4 bytes, 195,131,194,188

You need to test to find out where it’s going wrong. To help, use this to properly produce the character in php


$utf8_u_with_diaresis = chr(195) . chr(188);

Topic		Replies	Views
Carachters converted to <?>! Databases	7	672	March 26, 2010
PHP Encoding UTF Nightmare PHP	2	788	August 31, 2015
MySql encoding problem - can't change to utf-8 Databases	5	32779	October 8, 2014
Pdo+mysql and utf-8 encoding problem; PHP	8	11892	December 23, 2010
Character Set Help PHP	12	814	March 13, 2011

Character Encoding Issue

Related topics