Character Set Help

Right - I ‘get’ character sets and understand all the theory, but that doesn’t appear to stop me running into problems with it all. I have a value in the database which is a unicode character (I think - I copied and pasted a name from facebook into phpMyAdmin!), and phpMyAdmin displays it fine. If I write a simple script to echo it from the database, it appears to work fine (Firefox uses ISO-8859-1 encoding to show the page), but if I change firefox into utf8 and use utf8_encode, it shows some other random characters instead.

Ultimately, I’m trying to get this value into an XML document (which can be seen here - http://flying.nugc.net/export/lists/4.xml), but at the moment the xml validation fails because of it.

What should I be looking at / doing to try and get the whole thing to work in unicode, cause everything I try and do doesn’t seem to work and does the opposite of what I expect!

Take a browse over the user comments here. :slight_smile:

PHP: mysql_set_charset - Manual

You shouldn’t have to utf8_encode anything.

I thought not, especially as it was just making it worse! I did look at that function, but I got confused by the fact that it appeared to be working if i switched firefox to iso-8859-1 mode so thought it couldn’t be the database.

Cheers though, will take a look!

It’s good you provided the xml file otherwise it would be impossible to guess what the problem is. It looks like you have the encodings really screwed up :). The data are not in ISO 8859-1 but in Windows-1250! You’ve probably been misled by Firefox displaying it properly in ISO-8859-1, but change FF to Windows-1250 and they will appear fine, too. The reason for this are the two “unicode” characters, as you call them, Ž and Š which are not part of ISO-8859-1 at all and you cannot represent them in that encoding. However, they are covered by Windows-1250 and their values are 8E and 8C respectively. These values are undefined in ISO-8859-1 at all so FF trying to be a nice browser will switch to Windows-1250 just for these two characters hoping it will guess them right.

So in effect you need to use mb_convert_encoding or its iconv counterpart to convert it to utf-8.

Now, you probably want to sort this all out to have everything in utf-8. You need to start with the database first and be sure you have inserted the data correctly. You need to pay attention to these things:

  1. First, make the text columns in your table in utf8. utf8_general_ci collation is a good general choice.

  2. Second, you need to connect with the db in utf8 as well - this includes setting up phpmyadmin to make utf8 connection with the db and display its pages in your browser in utf8. Set up properly “Mysql connection collation” on the main page. There used to be a choice of charset in the language select box but it looks like the newer versions of PMA are all set to utf8.

  3. Now you can safely insert data into db with all weird characters.

  4. AnthonySterling’s suggestion is what you should be using in your php script. Now all will be in utf8 and no conversions will be necessary.

I’m really puzzled how you ended up with Windows-1250! My guess is you might have something screwed up in PMA or your table structure and the data weren’t stored properly in the first place. Or, in your php script the connection collation could be a default Windows-1250 - possible if the db was set up that way. Mysql’s character settings can be overwhelmning at the beginning but they provide a very fine-grained control over character sets. I remember spending hours solving character set problems when Mysql 4.1 came out with all these new features!:smiley:

That post is a great help, I will give all this a go when I get home! Many Thanks!

The windows 1250 encoding probably came from Facebook - I copied and pasted the name from Facebook straight into PMA, including the foreign characters. Conversion could even have been done in the clipboard.

No, it doesn’t matter where you copied the name from nor what encoding the facebook page uses, as long as it appears correct after pasting it into the browser’s input box it is good (I think all operating systems nowadays use some form of unicode for clipboard and other internal storage). What matters is the encoding the browser will use to transfer the name to the server, which is either page encoding or encoding defined in the accept-charset attribute of the form tag. In this case are also important db connection settings: character set of queries sent to db, character set of connection with the db and character set of results sent from the db to php (see Connection Character Sets and Collations) - it’s easiest to use SET NAMES ‘utf8’ or mysql_set_charset(‘utf8’) and have it all set up for utf8 and then no problems arise.

OK, all makes sense. So what’s the difference between utf8_unicode_ci and utf8_general_ci?

There’s a subtle difference, both of them can store the same characters, you will only have differences when searching or sorting strings that contain certain language-specific letters - see this. For searches I most often use utf8_unicode_ci, but if the column is not defined as utf8_unicode_ci, it can be utf8_general_ci or any other utf8_ and then do this:

SELECT * FROM tbl WHERE col LIKE '%search string%'  COLLATE 'utf8_unicode_ci'

I suppose utf8_unicode_ci is better for wider language support.

I thought this was the case, but I checked anyway, and all my database tables, columns, PMA connection is all utf8_unicode_ci. My editor saves in utf8 and the meta tag is set to utf8. I’m unsure about the apache headers, but the meta tag should overrule that, even if it is a bit sloppier / slower to process.

I guess it must be the connection from my php script then, will look at that one next!

By meta tag, I meal XML header in this case of course :stuck_out_tongue:

Used mysqli->set_charset(), works perfectly now :slight_smile:

Thanks all for your help!

Glad you have sorted it out!

You can check the Apache headers with Live HTTP Headers Add-On for Firefox, and it they are not what you want them to be you can either change it in an .htaccess


AddType 'text/xml; charset=UTF-8' xml

That is, assuming it is a real file. If the xml file is generated on the fly by some php script and is not a real file you can also let PHP handle the header:


header('Content-Type: text/xml; charset=utf-8');

:slight_smile: