SitePoint Sponsor |
|
User Tag List
Results 1 to 12 of 12
-
Jun 7, 2009, 23:40 #1
- Join Date
- Jan 2001
- Location
- Ottawa ON
- Posts
- 315
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
What am I doing wrong with character encoding on this site?
This is a web page that I'm creating: http://harpersblackbook.ca/main.php
I want to be able to use curly quotes and whatnot on the site without boxes or question-marks appearing. But currently that's not working.
I have encoded the content in UTF-8 in my database. The page is being sent to the client as UTF-8 (as specified in the <head> of the page through a metatag and in the HTTP header via PHP).
What am I doing wrong and what do I need to do to properly display these extended characters?
Thank you. I read a couple of articles like this quite good one to get a feel for the basics: http://www.joelonsoftware.com/articles/Unicode.html
-
Jun 7, 2009, 23:57 #2
- Join Date
- Nov 2004
- Location
- Ankh-Morpork
- Posts
- 12,158
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
The page declares the encoding to be UTF-8, but the quotation mark isn't properly encoded as a UTF-8 quotation mark (U+201C). It should be encoded as E2 80 9C, but it's encoded as a single octet (93) which is the Windows-1252 encoding for the left double quote (“) character.
So either you've got incorrect data in your database, or there's some sort of conversion before the page is sent.Birnam wood is come to Dunsinane
-
Jun 8, 2009, 09:08 #3
- Join Date
- Jan 2009
- Location
- Dallas
- Posts
- 990
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
More than likely (certitude?) the data was saved as Western/Windows, MSFT's proprietary charset/encoding. I think that's the more or less default in MSFT and many MSFT oriented editors.
cheers,
garyAnyone can build a usable website. It takes a graphic
designer to make it slow, confusing, and painful to use.
Simple minded html & css demos and tutorials
-
Jun 8, 2009, 11:48 #4
- Join Date
- Jan 2001
- Location
- Ottawa ON
- Posts
- 315
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Yes, I think that's what has happened.
I suppose my first question is how do I convert the text in my database from Microsoft's proprietary character encoding to Unicode? Is there a function to do that in phpMyAdmin?
I am also curious about how you troubleshot that problem - how can I determine the way that the character is encoded in future?
I suppose the last matter that I need to learn about going forward is with forms that submit content into my database, how I ensure that the data is going into the database into the right format; in the past I've dealt with this issue with the PHP function to convert those characters to their character entities. But that's clearly a less than ideal response to the problem - and it's one that reflects my lack of knowledge about how to do it right.
Thank you.
-
Jun 8, 2009, 17:19 #5
- Join Date
- Jan 2001
- Location
- Ottawa ON
- Posts
- 315
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Well, I read this page - http://www.tuxsudo.com/?p=4 - and switched the character encoding to iso-8859-1 and things are looking good. So, I suppose I'll just use that character encoding in future and text copied-and-pasted from MS Word with its "smart quotes" won't result in question marks or boxes appearing.
-
Jun 8, 2009, 18:34 #6
- Join Date
- Jan 2001
- Location
- Ottawa ON
- Posts
- 315
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Nevermind - that solution isn't working now as I try to create an RSS feed. I'm getting an 'invalid character' error from my XML for that "smart quote": http://harpersblackbook.ca/feed.php
What's the solution in terms of crafting an rss feed?
I'm setting the content type through PHP
I wonder what else I need. Surely not a CDATA block.
I tried running the feed through a validator, but that wasn't helpful because it says that the feed is valid and doesn't return a relevant error/suggestion: http://feedvalidator.org/check.cgi?u....ca%2Ffeed.phpLast edited by prowsej; Jun 8, 2009 at 19:50.
-
Jun 8, 2009, 22:50 #7
- Join Date
- Nov 2004
- Location
- Ankh-Morpork
- Posts
- 12,158
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
In this case I just ran the page through the HTML validator. It says the offending non-UTF-8 character is '\x93', if you read the error message closely.
Another way would be to retrieve the page via cURL, and look at it in a decent text editor. I use vim. By pressing 'g8' it will show the octet(s) used to encode a particular character. If you set the editor's encoding to UTF-8, it would show this offending character as '<93>'.
The first thing you should do is to use a accept-charset="utf-8" attribute in your <form> tag. That will tell user agents that your application only accepts UTF-8 encoded data. It's no guarantee, though, so you should still validate it server-side before inserting it into your database.
I've written a SitePoint article about character encoding that you may find useful as a primer if you're unfamiliar with the ins and outs of encodings.
That's because browsers are kind enough to let you get away with using Windows-1252 and declaring ISO 8859-1. It's still wrong, though, and the HTML validator will complain. U+0093 is an invalid character in an HTML document. It's in the range reserved for C1 control characters in ISO 8859-1.
The same as anywhere else: the encoding you use in your page must match the encoding you declare.
That won't make a difference. It only lets you avoid escaping '<' and '&' characters, nothing more.Birnam wood is come to Dunsinane
-
Jun 9, 2009, 01:04 #8
- Join Date
- Aug 2007
- Location
- Netherlands
- Posts
- 10,287
- Mentioned
- 51 Post(s)
- Tagged
- 2 Thread(s)
So does the OP need to manually regex through all the documents, find and replace?? To take MS-generated stuff and turn it into UTF-8?
I wish I could regex my colleague : (
in the past I've dealt with this issue with the PHP function to convert those characters to their character entities. But that's clearly a less than ideal response to the problem - and it's one that reflects my lack of knowledge about how to do it right.
-
Jun 9, 2009, 01:21 #9
- Join Date
- Nov 2004
- Location
- Ankh-Morpork
- Posts
- 12,158
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
-
Jun 9, 2009, 04:11 #10
- Join Date
- Aug 2007
- Location
- Netherlands
- Posts
- 10,287
- Mentioned
- 51 Post(s)
- Tagged
- 2 Thread(s)
With so many customers out there sending things to companies in win 1252, I would be incredibly surprised to find that there isn't some tool out there that will let people (fairly easily) change improper characters to correct ones.
-
Jun 9, 2009, 11:36 #11
- Join Date
- Oct 2008
- Location
- United States
- Posts
- 33
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
FYI - I had a similar problem a few months ago and got some good advice here:
http://www.sitepoint.com/forums/show....php?p=4221647
One of the user contributed functions in the PHP manual seemed to convert very accurately:
http://php.net/manual/en/function.utf8-encode.php#45226
-
Jun 9, 2009, 17:25 #12
- Join Date
- Jan 2001
- Location
- Ottawa ON
- Posts
- 315
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Thank you for your suggestions, everyone.
I have gotten my RSS feed (and site) working by specifying the character encoding as windows-1252.
And I feel like I've learned a bit more about character encoding in the process.
Bookmarks