Character Encoding Issue

I’m having a little trouble getting character encodings to work for one of my applications. I have a large directory of names, many of which are foreign. My goal is to get as many of these names to render as possible. I’ve decided to use a utf-8 encoding to get this to work.

I’ve had to do a lot of work and testing to figure out everywhere that I need to change this encoding but I think i’m close. There seems to be just 1-2 places left that could be causing the problem.

An example of the problem. The word Zürich renders as Zürich.

In order to get the names into the database I imported them using a csv file. I use a mysql database.

These are the places i’ve made changes/checked to ensure that the character encoding is UTF-8.
Original .csv file is UTF-8
Database is UTF-8
Database table encodings are UTF-8
Webpage Doctype is UTF-8.
Putty SSH Client - UTF-8 encoding.

I’ve viewed the data directly in the mysql database and it renders correctly as Zürich. However, when it is displayed on the website it displays as Zürich. This leads me to believe that there is something in the process between the database and the webpage that is messign up the data (because both the database and webpage our UTF-8 encodings). I’ve also tried statically displaying Zürich on the webpage and it displays fine so it must be something else.
Do I need to control the encoding in the SQL query on the database? Can you think of anything else that might be causing this problem to occur?

For reference: here is the doctype of the webpage displaying the data.

<!DOCTYPE html
PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN””>

<html xmlns=“” xml:lang=“en” lang=“en”>
<meta http-equiv=“Content-Type” content=“text/html; charset=UTF-8” />

I really appreciate your help. This has become quite a frustrating mess.

Edit: I’ve narrowed it down more.

As previously mentioned, the character encoding appears to be correct in the database when viewing through putty.
I created a simple sql command to grab the data from the database. Then ran these two lines of code.
$result = $db->query($sql);
$row = $result->fetch_assoc();
print “Zürich”;

The display I get is:

So the problem is isolated to the database query. Suggestions on what I need to do to fix this?

query “SET NAMES encoding” must be executed every time database is used.
right after database connect is a good place
mysql_query(“SET NAMES utf8”);
or other functions if you’re using some fancy libraries

For further reference: [Connection Character Sets and Collations](Connection Character Sets and Collations)

I tried that and it didn’t make a difference. This is what i’m sitting with as far as my query goes right now. I’ve got a couple different attempts to fix the encoding issue built in. The actual connection is done in the constructor of the class. I’m also printing out the encoding as you can see and it prints ‘utf8’. However, the encoding is still messed up and the data does not display properly.

public function query($query)
        mysql_query("SET character_set_results = 'utf8', character_set_client = 'utf8', character_set_connection = 'utf8', character_set_database = 'utf8', character_set_server = 'utf8'",$this->link);
     $result = mysql_query($query,$this->link);

      $charset = mysql_client_encoding($this->link);
      print $charset;

         throw new Exception('Database query error');

      return new DbRowSet($result);

Could you confirm for us what you get from the server when you run the following command?


Well, I can state 2 things.

  1. Everything is set right.
  2. You’re good with debugging. So, continue with it. check everything again.
    make a simple code that adds a string to the database and displays it back.
    Here is the simple checklist I posted before.

I see this:

| Variable_name | Value |
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
| completion_type | 0 |
| concurrent_insert | 1 |
| connect_timeout | 10 |

One interesting thing I noticed was:

When I visit character_sets_dir | /usr/share/mysql/charsets/

It lists these charsets:

armscii8.xml cp1250.xml cp1256.xml cp850.xml cp866.xml geostd8.xml hebrew.xml Index.xml koi8r.xml latin1.xml latin5.xml macce.xml README
ascii.xml cp1251.xml cp1257.xml cp852.xml dec8.xml greek.xml hp8.xml keybcs2.xml koi8u.xml latin2.xml latin7.xml macroman.xml swe7.xml

No where in this list do i see a xml file for utf8. Is this a problem?

The README in there should say that those are only for single-byte libraries. UTF-8 should be compiled in the strings library instead.

All these variables are meaningless.
You can live with any values of server settings.
Runtime parameters only matter.

Send an http header to declare the charset/encoding.

header('Content-type: text/html;charset=utf-8');

One way to represent ü in unicode is code point 252. In utf-8 encoding, you need byte value 195 followed by 188.

Now, if I guess your declaration is currently something like iso-8859-1, and look at the character table, 195 maps to Ã, and 188 maps to ¼, and that’s what you’re getting. So the string from the db is utf-8 encoded.

Why does the print “Zürich”; work? Because it uses a single byte value of 252 to represent that character, which would be how you do it in iso-8859-*

You can compare with strlen() to see the extra byte.

What crmalibu said. Just to elaborate; The http meta-tag is a lie. The browser will completely ignore it, unless there aren’t any http headers with the response. When your page is served through a web server, that is usually the case. So in, other words, it does nothing. Calling header from php does work though.

I need a little more clarification to understand this. What exactly do you mean by declaration when you said? “if I guess your declaration is currently something like iso-8859-1” and how do I check this?
Where exactly do I need to use the header line? Does it go in php next to the query? Or does it need to go in the template next to the display of the item?

Thanks for your help.

how do I check this?

Use your browser’s menu, View - Encoding and see what browser thinks of page encoding.
You can see header itself by using one of HTTP sniffer tools, like LiveHTTPHeaders FireFox addon, for example

Where exactly do I need to use the header line?

Before any output to the browser.
It belongs not to SQL nor HTML, but HTTP. It is HTTP header, and must be sent to the browser before any output.

Using HTTPFox i started the http monitoring and loaded the page with the encoding problem. I then clicked on the heading to have it display all the details. I see the following two references to the charset.

Request Header: Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7
Response Header: Content-Type text/html;charset=utf-8

This is after adding the header line to my code (before anything is rendered). I’m sure it brings us a step closer to a solution, but the problem still exists. :injured:

Your import part can spoil the data.
I wonder why don’t you get yourself a test code yet.

post the output of

echo base64_encode($row['foo']);

Have you tried:


From the PHP manual page on mysql_set_charset()

Note: This is the preferred way to change the charset. Using mysql_query() to execute SET NAMES … is not recommended.

Ya that was one of the first thing I tried but it didn’t help. It does appear to set the encoding correctly though which confuses me. I set this before the query, run the query and then print the encoding type usingmysql_client_encoding($this->link). Despite this, the text continues to show the encoding error.

post the output of

echo base64_encode($row['foo']); 

Output of base64_encode:

Out of my own curiosity, what exactly is the purpose of this information?

At what point are you using mysql_set_charset(‘utf8’,$link); as it needs to be used the very next line after you select the database which your using.

Because now I can use base64_decode() and perfectly replicate that binary string, and be confident about what it is. Without it, there’s a possibility that the data might get changed after it passes through so many software layers(browsers, this forum).

Your data looks like it was double encoded. You should have byte 2 bytes, values 195,188 to represent that character. You have 4 bytes, 195,131,194,188

You need to test to find out where it’s going wrong. To help, use this to properly produce the character in php

$utf8_u_with_diaresis = chr(195) . chr(188);