SitePoint Sponsor

User Tag List

Results 1 to 13 of 13
  1. #1
    SitePoint Wizard
    Join Date
    Oct 2005
    Posts
    1,834
    Mentioned
    5 Post(s)
    Tagged
    1 Thread(s)

    UTF-8 Character Problems With MySQL

    I have never had so much trouble with anything in my life as I have with this UTF-8 character set.

    I have a site and a phpBB forum. I run a query on the phpBB topics table to pull the most recently updated topics for display elsewhere on my site. I basically pull the topic title.

    The problem is that Russian characters (from spammers) show up just as question marks elsewhere on my site but display fine as Russian characters in the forum.

    I have this line in my HTML for both the forum and my website:

    <meta http-equiv="content-type" content="text/html; charset=UTF-8" />

    On the Home page for phpMyAdmin it displays this information about the character set:

    MySQL charset: UTF-8 Unicode (utf8)

    phpMyAdmin lists the collation on the phpBB topics table as:

    utf8_bin

    The collation on the topic_title field inside the phpBB topics table is:

    utf8_unicode_ci

    When I pull the topic_title to display on the rest of my site, I do a basic mysqli query, loop through the result set and then display the field. $row['topic_title'].

    So, what could possibly be going wrong to cause my query to display question marks when phpBB displays the Russian characters just fine?

    Do I need to put something in my PHP code to force the MySQL server to always return UTF-8? Or is it a problem in my web page where I have to manually set a UTF-8 header for each and every page view?

    Thanks for your help.

  2. #2
    SitePoint Addict
    Join Date
    Aug 2007
    Location
    GR
    Posts
    352
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Try to run this before any query:
    "SET NAMES UTF8";

  3. #3
    SitePoint Wizard
    Join Date
    Oct 2005
    Posts
    1,834
    Mentioned
    5 Post(s)
    Tagged
    1 Thread(s)
    Thanks for the advice, Steve. It seems a pain in the butt to have to run an additional query every time I do something. But if is the way to do it, then I'll do it.

    I ran show variables on MySQL and this is what it returned:

    character_set_client utf8
    character_set_connection utf8
    character_set_database latin1
    character_set_filesystem binary
    character_set_results utf8
    character_set_server latin1
    character_set_system utf8
    character_sets_dir /usr/share/mysql/charsets/
    collation_connection utf8_general_ci
    collation_database latin1_swedish_ci
    collation_server latin1_swedish_ci
    Could this be causing my problem? And could I just run a query to permanently change the character set (collation) to utf8_general_ci or something? Perhaps change the collation in phpMyAdmin?

    But would I risk screwing up anything else? If the database returns UTF-8, PHP will echo out UTF-8 without any additional handling, won't it?

  4. #4
    SitePoint Addict
    Join Date
    Aug 2007
    Location
    GR
    Posts
    352
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    This query is fast, I do it on each page load for every site that has non latin chars. Right now you should not
    be trying to change collation in the db as it might break something. Just try it ans see if it works.

  5. #5
    SitePoint Wizard
    Join Date
    Oct 2005
    Posts
    1,834
    Mentioned
    5 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by _Steve_ View Post
    This query is fast, I do it on each page load for every site that has non latin chars. Right now you should not
    be trying to change collation in the db as it might break something. Just try it ans see if it works.
    OK, thanks.

  6. #6
    From space with love silver trophy
    SpacePhoenix's Avatar
    Join Date
    May 2007
    Location
    Poole, UK
    Posts
    5,032
    Mentioned
    103 Post(s)
    Tagged
    0 Thread(s)
    Are you using the MySQL or MySQLi interface? Both have a mysqli_set_charset function which should be used right after the connection to MySQL has been established, this sets the charset for the connection that has been established to use whatever charset you have set it to use.
    Community Team Advisor
    Forum Guidelines: Posting FAQ Signatures FAQ Self Promotion FAQ
    Help the Mods: What's Fluff? Report Fluff/Spam to a Moderator

  7. #7
    SitePoint Wizard
    Join Date
    Oct 2005
    Posts
    1,834
    Mentioned
    5 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by SpacePhoenix View Post
    Are you using the MySQL or MySQLi interface? Both have a mysqli_set_charset function which should be used right after the connection to MySQL has been established, this sets the charset for the connection that has been established to use whatever charset you have set it to use.
    I'm using mysqli. I was wondering how phpBB does it. Russian characters come out right in phpBB. I formatting the output of the topic_title using htmlspecialchars(), and I'm using UTF-8 as the $charset parameter. So I don't think that would be causing me any problems.

    As a side note, using mysqli::set_charset() is the recommended way of setting the character set for communication with the MySQL server over a query using SET NAMES.

    This is the preferred way to change the charset. Using mysqli::query() to execute SET NAMES .. is not recommended.
    I guess going forward I'll incorporate a call to mysqli::set_charset() every time I make a connection to the DB just as a precaution to prevent any problems. Live and learn!

    Thanks for the suggestions, folks.

  8. #8
    Non-Member bronze trophy
    Join Date
    Nov 2009
    Location
    Keene, NH
    Posts
    3,760
    Mentioned
    23 Post(s)
    Tagged
    0 Thread(s)
    Stupid question -- you have the meta set, but how about your mime-type? Meta is mostly lip service and ignored by 99% of user agents. Is your server sending a iso-8859-1 mime-type by default?

    Are you using the php "header" command thus in your script?

    header('Content-Type: text/html; charset=UTF-8');

    If you aren't, the server is outputting whatever it's default is, and on 99.99% of servers that's iso-8859-1.

    If you have opera installed, go to your offending page, open up the "info" pane, and see what it says is being used for encoding and mime-type. Dimes to dollars this is your problem.

  9. #9
    SitePoint Wizard
    Join Date
    Oct 2005
    Posts
    1,834
    Mentioned
    5 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by deathshadow60 View Post
    Stupid question -- you have the meta set, but how about your mime-type? Meta is mostly lip service and ignored by 99% of user agents. Is your server sending a iso-8859-1 mime-type by default?

    Are you using the php "header" command thus in your script?

    header('Content-Type: text/html; charset=UTF-8');

    If you aren't, the server is outputting whatever it's default is, and on 99.99% of servers that's iso-8859-1.

    If you have opera installed, go to your offending page, open up the "info" pane, and see what it says is being used for encoding and mime-type. Dimes to dollars this is your problem.
    Thanks for the advice. I did read elsewhere that it is a good idea to set the header to UTF-8 explicitly. But I also read somewhere that Apache will examine the HTML for the charset and then send it to the browser based on whatever is specified in the code.

    The HTML is generated by PHP. UTF-8 is specified in the HTML head as mentioned above and according to Firefox's Page Info (Tools -> Page Info) the page is encoded as UTF-8. It's reaching the browser as UTF-8.

    My hunch is that the database collation is what is screwing things up and the data connection is not transmitting in UTF-8.

    It is a little confusing. We have the charset in the HTML, the possibility of a browser header specifying the charset, then we have the MySQL server character set, the database collation, and finally the field collation. There's a lot of places where something get go amiss.

  10. #10
    Non-Member bronze trophy
    Join Date
    Nov 2009
    Location
    Keene, NH
    Posts
    3,760
    Mentioned
    23 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by cheesedude View Post
    Thanks for the advice. I did read elsewhere that it is a good idea to set the header to UTF-8 explicitly. But I also read somewhere that Apache will examine the HTML for the charset and then send it to the browser based on whatever is specified in the code.
    I'd be interested in hearing where that came from -- I've never ONCE seen apache by default send UTF-8 just because you output it in the meta. In fact i've never heard of Apache parsing the files it's sending unless it was for SHTML. That's REALLY none of it's business in a way... It's also why on all of my servers I put in this .htaccess when doing utf-8 templates for people:

    Code:
    <FilesMatch "\.(htm|html)$">
       AddDefaultCharset UTF-8
       DefaultLanguage en-US
    </FilesMatch>
    Hell, by the time Apache HAS the markup it's too late to send the connection header. Part of why in PHP you HAVE to call the header function before you output anything else.

    I also suggested Opera because it will show you the ACTUAL mime-type in the content header... the view->character encoding is pretty useless in FF... and with the web developer toolbar being broken on reporting header info on anything newer than FF 3.5... Though it's Information > response headers should tell you what's going on...

    You are correct though, there are SO MANY possible points of failure along the chain it's ridiculous.

  11. #11
    SitePoint Wizard frank1's Avatar
    Join Date
    Oct 2005
    Posts
    1,392
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Long back in my case it was encoding of the add file that was causing that problem...
    and
    The problem was there when i use to copy and paste sql but when i use to import sql file using utf-8 setting it use to work fine

  12. #12
    SitePoint Wizard
    Join Date
    Oct 2005
    Posts
    1,834
    Mentioned
    5 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by deathshadow60 View Post
    I'd be interested in hearing where that came from -- I've never ONCE seen apache by default send UTF-8 just because you output it in the meta. In fact i've never heard of Apache parsing the files it's sending unless it was for SHTML. That's REALLY none of it's business in a way...
    There is not as much documentation on how to use Opera as I would like. I went into View -> Developer Tools -> Page Information and that said the encoding used by Opera was UTF-8 and that the MIME type was text/html.

    Code:
    Encoding (used by Opera): utf-8 (utf-8)
    Is that what you are looking for? Or should I be looking for something else and if so, where do I find it? I've looked all over and can't find anything else.

    Another site on the same server is served in ISO-8859-1 as specified in the HTML meta according to Opera's Page Information.

    If I'm not looking at the right thing, let me know.

    Quote Originally Posted by deathshadow60 View Post
    It's also why on all of my servers I put in this .htaccess when doing utf-8 templates for people:

    Code:
    <FilesMatch "\.(htm|html)$">
       AddDefaultCharset UTF-8
       DefaultLanguage en-US
    </FilesMatch>
    I expect that this:

    Code:
    <FilesMatch "\.(htm|html|php|css|js)$">
    will work to serve PHP, CSS, and JS as UTF-8, too?

    Quote Originally Posted by deathshadow60 View Post
    I also suggested Opera because it will show you the ACTUAL mime-type in the content header...
    Let me know where.

    I checked a couple HTTP Header checker websites and it looks like there is no content-type header being sent by the server I am on.

  13. #13
    Non-Member bronze trophy
    Join Date
    Nov 2009
    Location
    Keene, NH
    Posts
    3,760
    Mentioned
    23 Post(s)
    Tagged
    0 Thread(s)
    That was what I was suggesting looking at... do you have the web dev toolbar installed for FF? What does it report?

    NEVER heard of Apache much less PHP sending content-type other than iso-8859-1 as the default unless you edit it to do so. The header checkers saying none is effectively the same thing... if "information>Response headers" in the web dev toolbar for FF doesn't list content-type, you're sending none, so the browser is best guessing... Smarter browsers like Opera and Chrome may actually switch to UTF-8 because of the Meta -- FF it's probably a coin toss given it's still Netscape 4's sweetly retarded cousin (life is like a box of open source) .. but if there's no header IE will default to ISO-8859-1 on IE6/newer assuming not in quirks mode, and IE5/earlier will actually default to windows-1252.

    Do you have a link to the page in question?

    Oh, and yes, that change will cover all those files -- NOT that I'd bother with .CSS since by the specification it's only supposed to have ASCII7 in it in the first place even if you change the encoding.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •