SitePoint Sponsor

User Tag List

Results 1 to 9 of 9
  1. #1
    SitePoint Enthusiast Homie_187's Avatar
    Join Date
    Oct 2008
    Location
    United States
    Posts
    33
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Encoding Issue with Curly Quotation Marks

    Hi,

    This is probably a simple encoding issue but I don't know much about encoding so I am asking for some help:

    I have a MySQL database with a lot of text containing curly, left and right double quotation marks.

    I can see the quotation marks when I look at the data using PHPMyAdmin, and if I copy and paste the text into my code editor I find that the hex codes are 1C20 and 1D20.

    I am writing a php script that selects text from the database and serves it as UTF-8. When I view the page, I see only question marks where the quotation marks should be, and after copying and pasting into the hex editor I see that the quotation marks (both left and right) have somehow become hex FFFD which is apparently not even a real character(?)

    So, my question is, what do I have to do in the php script in order to output the quotation marks as recognizable characters.

    Thanks.

  2. #2
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You can't copy and paste like that in order to get the actual bytes for a character. Between the page and wherever you paste the text into, it will not be the same.

    Try changing the encoding of the page until you find the encoding that works.

  3. #3
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If you're serving the data as UTF-8, and it looks wrong, then the data isn't UTF-8. Notice that even if your connection to the database and the database tables are set to UTF-8, it doesn't prevent you from storing some other encoding in it. The really damning thing about charset issues, is that if you already have content in your database, which is in an unknown encoding, it becomes a lot more complex to fix it. Is this a live application, or are you still in initial development mode?

  4. #4
    SitePoint Enthusiast Homie_187's Avatar
    Join Date
    Oct 2008
    Location
    United States
    Posts
    33
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for the info.

    The site is not live yet, so there's no emergency. If I needed a quick fix, it would be pretty easy for me to just do a search/replace on a sql dump and change the curly quotation marks to ascii quotation marks.

    I am just interested in knowing if there is a way I can fix it without changing the actual content. In the future I want to be able to support users copying and pasting in text from MS Word, which means that the site is going to have to support curly quotation marks at some point, so I might as well try now.

    I read Tommy's guide to encoding and he recommends using UTF-8 so I figured I would give it a shot.

    I figure there must be a way to serve the content that I have as UTF-8, because PHPMyAdmin serves it with a UTF-8 header and everything looks fine. I don't know if it is converting those characters or if it's doing something else that I need to be doing in my script. Maybe I could just try digging through the code to see if I can figure out what it does.

  5. #5
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    First try changing the encoding of the page until you find the encoding that works. Then after you have the encoding, then we can give pointers to as what you can do.

    If your pages use UTF-8, everything sent by a proper browser will be in UTF-8.

  6. #6
    SitePoint Enthusiast Homie_187's Avatar
    Join Date
    Oct 2008
    Location
    United States
    Posts
    33
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Okay, I tried windows-1252 and it works.

    So... Is there a way that php can convert the content from windows-1252 to utf-8?

    Thanks.

  7. #7
    SitePoint Enthusiast Aken's Avatar
    Join Date
    Oct 2007
    Location
    Racine, Wisconsin, USA
    Posts
    99
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I think it'd make more sense to make the code itself web-friendly regardless of what encoding is used.

    You should tell your script to encode these special characters before it enters the database. htmlentities() is a great function for such a thing.
    Eric Roberts - Racine, WI Web Design & Development
    www.cryode.com

  8. #8
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Precisely what you need:
    http://php.net/manual/en/function.utf8-encode.php#45226

    About Aken's suggestion: You really shouldn't convert non-ASCII characters to HTML entities. What if you had to access your members database through a Jabber chat server? HTML entities stored in the database make no sense at that point.

  9. #9
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    cp1252 and iso-8859-1 looks a lot like each other, so they are often mixed up. In fact, some browsers will send data with cp-1252 encoding while (falsely) claiming it's iso-8859-1. That might be what went wrong for you. Another problem with iso-8859-1 is that if the user tries to send characters which doesn't exist in the encoding, the browsers will encode those characters as html-entities. This is a complete mess since you have no way of determining if the user meant to submit the literal html-entity or the character it translates to. There is no ambiguity with regards to utf-8, so if you just use that throughout, things are a lot easier. As you are in development phase still, I would suggest that you wipe all your data and check that everything speaks utf-8 before proceeding. Fixing ambiguous data can be a headache.

    Quote Originally Posted by Aken View Post
    You should tell your script to encode these special characters before it enters the database. htmlentities() is a great function for such a thing.
    I wouldn't recommend that, since it makes it impossible to use string-related functions in the database layer. For example you won't be able to sort correctly. You would also need to manually decode data before using them in a non-html context. Just use a charset which supports the codepoints you need to display, and you're fine.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •