SitePoint Sponsor

User Tag List

Results 1 to 12 of 12
  1. #1
    SitePoint Addict
    Join Date
    Jan 2001
    Location
    Ottawa ON
    Posts
    315
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    What am I doing wrong with character encoding on this site?

    This is a web page that I'm creating: http://harpersblackbook.ca/main.php

    I want to be able to use curly quotes and whatnot on the site without boxes or question-marks appearing. But currently that's not working.

    I have encoded the content in UTF-8 in my database. The page is being sent to the client as UTF-8 (as specified in the <head> of the page through a metatag and in the HTTP header via PHP).

    What am I doing wrong and what do I need to do to properly display these extended characters?

    Thank you. I read a couple of articles like this quite good one to get a feel for the basics: http://www.joelonsoftware.com/articles/Unicode.html

  2. #2
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The page declares the encoding to be UTF-8, but the quotation mark isn't properly encoded as a UTF-8 quotation mark (U+201C). It should be encoded as E2 80 9C, but it's encoded as a single octet (93) which is the Windows-1252 encoding for the left double quote (“) character.

    So either you've got incorrect data in your database, or there's some sort of conversion before the page is sent.
    Birnam wood is come to Dunsinane

  3. #3
    Resident curmudgeon bronze trophy gary.turner's Avatar
    Join Date
    Jan 2009
    Location
    Dallas
    Posts
    990
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    More than likely (certitude?) the data was saved as Western/Windows, MSFT's proprietary charset/encoding. I think that's the more or less default in MSFT and many MSFT oriented editors.

    cheers,

    gary
    Anyone can build a usable website. It takes a graphic
    designer to make it slow, confusing, and painful to use.

    Simple minded html & css demos and tutorials

  4. #4
    SitePoint Addict
    Join Date
    Jan 2001
    Location
    Ottawa ON
    Posts
    315
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yes, I think that's what has happened.

    I suppose my first question is how do I convert the text in my database from Microsoft's proprietary character encoding to Unicode? Is there a function to do that in phpMyAdmin?

    I am also curious about how you troubleshot that problem - how can I determine the way that the character is encoded in future?

    I suppose the last matter that I need to learn about going forward is with forms that submit content into my database, how I ensure that the data is going into the database into the right format; in the past I've dealt with this issue with the PHP function to convert those characters to their character entities. But that's clearly a less than ideal response to the problem - and it's one that reflects my lack of knowledge about how to do it right.

    Thank you.

  5. #5
    SitePoint Addict
    Join Date
    Jan 2001
    Location
    Ottawa ON
    Posts
    315
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well, I read this page - http://www.tuxsudo.com/?p=4 - and switched the character encoding to iso-8859-1 and things are looking good. So, I suppose I'll just use that character encoding in future and text copied-and-pasted from MS Word with its "smart quotes" won't result in question marks or boxes appearing.

  6. #6
    SitePoint Addict
    Join Date
    Jan 2001
    Location
    Ottawa ON
    Posts
    315
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Nevermind - that solution isn't working now as I try to create an RSS feed. I'm getting an 'invalid character' error from my XML for that "smart quote": http://harpersblackbook.ca/feed.php

    What's the solution in terms of crafting an rss feed?

    I'm setting the content type through PHP

    I wonder what else I need. Surely not a CDATA block.

    I tried running the feed through a validator, but that wasn't helpful because it says that the feed is valid and doesn't return a relevant error/suggestion: http://feedvalidator.org/check.cgi?u....ca%2Ffeed.php
    Last edited by prowsej; Jun 8, 2009 at 19:50.

  7. #7
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by prowsej View Post
    I am also curious about how you troubleshot that problem - how can I determine the way that the character is encoded in future?
    In this case I just ran the page through the HTML validator. It says the offending non-UTF-8 character is '\x93', if you read the error message closely.

    Another way would be to retrieve the page via cURL, and look at it in a decent text editor. I use vim. By pressing 'g8' it will show the octet(s) used to encode a particular character. If you set the editor's encoding to UTF-8, it would show this offending character as '<93>'.

    Quote Originally Posted by prowsej View Post
    I suppose the last matter that I need to learn about going forward is with forms that submit content into my database, how I ensure that the data is going into the database into the right format; in the past I've dealt with this issue with the PHP function to convert those characters to their character entities. But that's clearly a less than ideal response to the problem - and it's one that reflects my lack of knowledge about how to do it right.
    The first thing you should do is to use a accept-charset="utf-8" attribute in your <form> tag. That will tell user agents that your application only accepts UTF-8 encoded data. It's no guarantee, though, so you should still validate it server-side before inserting it into your database.

    I've written a SitePoint article about character encoding that you may find useful as a primer if you're unfamiliar with the ins and outs of encodings.

    Quote Originally Posted by prowsej View Post
    Well, I read this page - http://www.tuxsudo.com/?p=4 - and switched the character encoding to iso-8859-1 and things are looking good.
    That's because browsers are kind enough to let you get away with using Windows-1252 and declaring ISO 8859-1. It's still wrong, though, and the HTML validator will complain. U+0093 is an invalid character in an HTML document. It's in the range reserved for C1 control characters in ISO 8859-1.

    Quote Originally Posted by prowsej View Post
    What's the solution in terms of crafting an rss feed?
    The same as anywhere else: the encoding you use in your page must match the encoding you declare.

    Quote Originally Posted by prowsej View Post
    I wonder what else I need. Surely not a CDATA block.
    That won't make a difference. It only lets you avoid escaping '<' and '&' characters, nothing more.
    Birnam wood is come to Dunsinane

  8. #8
    om nom nom nom Stomme poes's Avatar
    Join Date
    Aug 2007
    Location
    Netherlands
    Posts
    10,269
    Mentioned
    50 Post(s)
    Tagged
    2 Thread(s)
    So does the OP need to manually regex through all the documents, find and replace?? To take MS-generated stuff and turn it into UTF-8?

    I wish I could regex my colleague : (

    in the past I've dealt with this issue with the PHP function to convert those characters to their character entities. But that's clearly a less than ideal response to the problem - and it's one that reflects my lack of knowledge about how to do it right.
    As I understand it, the PHP function that does that is guessing at the right character. This would be why it doesn't always get it right. It's likely most valuable when you've created the content in the first place, know which charset you'd saved the data as, and then want to change to another charset. It has a better change of guessing right in those cases.

  9. #9
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Stomme poes View Post
    So does the OP need to manually regex through all the documents, find and replace?? To take MS-generated stuff and turn it into UTF-8?
    I think it will be difficult to write such a regex, and even more difficult to write the replacement.
    Birnam wood is come to Dunsinane

  10. #10
    om nom nom nom Stomme poes's Avatar
    Join Date
    Aug 2007
    Location
    Netherlands
    Posts
    10,269
    Mentioned
    50 Post(s)
    Tagged
    2 Thread(s)
    With so many customers out there sending things to companies in win 1252, I would be incredibly surprised to find that there isn't some tool out there that will let people (fairly easily) change improper characters to correct ones.

  11. #11
    SitePoint Enthusiast Homie_187's Avatar
    Join Date
    Oct 2008
    Location
    United States
    Posts
    33
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    FYI - I had a similar problem a few months ago and got some good advice here:

    http://www.sitepoint.com/forums/show....php?p=4221647

    One of the user contributed functions in the PHP manual seemed to convert very accurately:

    http://php.net/manual/en/function.utf8-encode.php#45226

  12. #12
    SitePoint Addict
    Join Date
    Jan 2001
    Location
    Ottawa ON
    Posts
    315
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thank you for your suggestions, everyone.

    I have gotten my RSS feed (and site) working by specifying the character encoding as windows-1252.

    And I feel like I've learned a bit more about character encoding in the process.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •