SitePoint Sponsor

User Tag List

Results 1 to 8 of 8
  1. #1
    It's all Geek to me silver trophybronze trophy
    ralph.m's Avatar
    Join Date
    Mar 2009
    Location
    Melbourne, AU
    Posts
    24,112
    Mentioned
    448 Post(s)
    Tagged
    8 Thread(s)

    Character encoding question

    I'm a little confused by the whole topic of character encoding. As I understand it, the encoding is normally sent by the web server, and the encoding specified in the meta tag is only really useful if the page is viewed offline (or if the server does not specify an encoding).

    I'm not sure how to change the server encoding (I'm still learning about that) but I'm pretty sure my server is serving my pages as UTF-8 (going on the info supplied by Mozilla Web-sniffer).

    I thought that you really don't need to use character or entity references if you are using UTF-8, but I find that some characters still display as a ? if I don't use a character or entity reference. For example, I need do use them for curly quotes. Does this mean that my pages are not actually being sent as UTF-8, or am I missing something?

    (PS If you know a way of checking the true page encoding and want to test one of my pages, try the links in my signature below.)
    Facebook | Google+ | Twitter | Web Design Tips | Free Contact Form

    Forum Usage: Tips on posting code samples, images and more

    Forrest Gump: "IE is like a box of chocolates: you never know what you're gonna get."

  2. #2
    SitePoint Author Kevin Yank's Avatar
    Join Date
    Apr 2000
    Location
    Melbourne, Australia
    Posts
    2,571
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    It sounds like your text editor is to blame. You’re correct that if your page is being served as UTF-8, it should be able to contain just about any character without the need for character entity references, but this will only work if your text editor is encoding the file as UTF-8 when you save it. If, however, your text editor is saving the file as Windows-1252 (the default for most text editors), those characters will not be decoded correctly by the browser when it tries to read them with the UTF-8 encoding.

    Most text editors let you set the encoding to use in the Save As dialog, and you can usually also change the default encoding to UTF-8.

    As for converting the files you have already created, advanced programming editors will let you open a Windows-1252 file using that encoding and then save it as UTF-8, which will convert the characters in the process.
    Kevin Yank
    CTO, sitepoint.com
    I wrote: Simply JavaScript | BYO PHP/MySQL | Tech Times | Editize
    Baby’s got back—a hard back, that is: The Ultimate CSS Reference

  3. #3
    It's all Geek to me silver trophybronze trophy
    ralph.m's Avatar
    Join Date
    Mar 2009
    Location
    Melbourne, AU
    Posts
    24,112
    Mentioned
    448 Post(s)
    Tagged
    8 Thread(s)
    Thanks for your reply, Kevin. My text editor is Dreamweaver (I only use code view... it's basically an uploading mule!) and all pages are set to Unicode 5.0 UTF-8, so I'm pretty sure the pages are being saved as that. I don't mind using character references, but I recently read that I shouldn't have to, hence my interest in getting to the bottom of this.
    Facebook | Google+ | Twitter | Web Design Tips | Free Contact Form

    Forum Usage: Tips on posting code samples, images and more

    Forrest Gump: "IE is like a box of chocolates: you never know what you're gonna get."

  4. #4
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by ralph.m View Post
    I'm a little confused by the whole topic of character encoding. As I understand it, the encoding is normally sent by the web server, and the encoding specified in the meta tag is only really useful if the page is viewed offline (or if the server does not specify an encoding).
    That is correct.

    Quote Originally Posted by ralph.m View Post
    I'm not sure how to change the server encoding
    That depends on which web server you are using, of course. In Apache you can use directives like AddDefaultCharset (in the global httpd.conf file or in local .htaccess files), for instance.

    If you're using a server-side scripting/programming language, you can also send the Content-Type HTTP header including a charset attribute that way. For instance, using PHP,
    Code PHP:
    header('Content-Type: text/html; charset=utf-8');

    Quote Originally Posted by ralph.m View Post
    I'm pretty sure my server is serving my pages as UTF-8 (going on the info supplied by Mozilla Web-sniffer).
    Opera's Info panel will also show you the encoding sent by the server for any page.

    Quote Originally Posted by ralph.m View Post
    I thought that you really don't need to use character or entity references if you are using UTF-8, but I find that some characters still display as a ? if I don't use a character or entity reference.
    You don't. UTF-8 is capable of encoding any literal Unicode characer. If you get '?' characters in your output, you are probably not saving the source file as UTF-8.

    Quote Originally Posted by ralph.m View Post
    For example, I need do use them for curly quotes. Does this mean that my pages are not actually being sent as UTF-8, or am I missing something?
    It means you are probably not saving your pages as UTF-8. Does this happen only when you copy text with curly quotes from other apps (like MS Word), or also when you insert them directly in your editor?

    Quote Originally Posted by ralph.m View Post
    (PS If you know a way of checking the true page encoding and want to test one of my pages, try the links in my signature below.)
    They are served as UTF-8.
    Birnam wood is come to Dunsinane

  5. #5
    Follow: @AlexDawsonUK silver trophybronze trophy AlexDawson's Avatar
    Join Date
    Feb 2009
    Location
    England, UK
    Posts
    8,111
    Mentioned
    0 Post(s)
    Tagged
    1 Thread(s)
    To be fair though Ralph, I think it's probably better to use character entity references whether you use Unicode or not, one issue I commonly see with websites that ignore it is when translations of pages occur it can often cause some seriously wonky results if the characters aren't properly declared, it's a convention which browsers tend to be uncompromising with and you don't want to run the risk of someone overriding the encoding method (like all browsers can easily) and see your characters glitch as a result. Though perhaps it's just me being a bit stiff in my implementation, I tend to just use them without question!

  6. #6
    It's all Geek to me silver trophybronze trophy
    ralph.m's Avatar
    Join Date
    Mar 2009
    Location
    Melbourne, AU
    Posts
    24,112
    Mentioned
    448 Post(s)
    Tagged
    8 Thread(s)
    Thanks very much for your replies, Tommy and Alex.

    Quote Originally Posted by AutisticCuckoo View Post
    That depends on which web server you are using, of course.
    I'm using a VPS running on Apache. I haven't found a way to manipulate the settings yet, but am working on it. Anyway, it seems the pages are being sent as UTF-8 anyhow.

    Quote Originally Posted by AutisticCuckoo View Post
    If you're using a server-side scripting/programming language, you can also send the Content-Type HTTP header including a charset attribute that way. For instance, using PHP,
    Code PHP:
    header('Content-Type: text/html; charset=utf-8');
    That's a handy tip. I'm not exactly sure how to place that code, though (for future reference). Would it go above the page's HTML with PHP tags?

    Quote Originally Posted by AutisticCuckoo View Post
    Opera's Info panel will also show you the encoding sent by the server for any page.
    That sounds handy. I wasn't able to find the Opera 'Info panel'. Where is it located?

    Quote Originally Posted by AutisticCuckoo View Post
    If you get '?' characters in your output, you are probably not saving the source file as UTF-8... Does this happen only when you copy text with curly quotes from other apps (like MS Word), or also when you insert them directly in your editor?
    O dear, I tried to replicate this, but now I can't! But yes, it's very likely that this was the cause. Silly me. I was doing that a fair bit on a recent project.

    Quote Originally Posted by AlexDawson View Post
    To be fair though Ralph, I think it's probably better to use character entity references whether you use Unicode or not... I tend to just use them without question!
    Actually, that's been my attitude all along. But the other day I was reading Tommy's article where he pointed out (as I read it, anyhow) that it wasn't necessarily a worthwhile practice, which made me think again (of course, I can't find the statement now...). As you say, though, it does seem a bit safer. I'm also inclined to stick with them.
    Facebook | Google+ | Twitter | Web Design Tips | Free Contact Form

    Forum Usage: Tips on posting code samples, images and more

    Forrest Gump: "IE is like a box of chocolates: you never know what you're gonna get."

  7. #7
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by AlexDawson View Post
    I think it's probably better to use character entity references whether you use Unicode or not
    I strongly disagree! If you do that, you can just set your editor and web server to US ASCII and be done with it.

    The whole point of choosing a character encoding is so that you can avoid the bloated and error-prone entity references and NCRs.

    Although the issue may be rather minor for an English-speaking author, I can assure you that reading or writing text copy in Swedish is not fun when looks like 'blåbärsgröt'.

    There are some special characters where I advocate using a NCR (or an entity reference), viz. those that are not printable and/or are hard to distinguish from other characters. Examples: soft-hyphen (U+00AD) and non-break space (U+00A0). I'll write those as ­ and  , respectively. Using ­­ and   should also be fairly safe (unless you use XHTML).

    Quote Originally Posted by ralph.m View Post
    That's a handy tip. I'm not exactly sure how to place that code, though (for future reference). Would it go above the page's HTML with PHP tags?
    Since this is sent as an HTTP response header, it must precede any content written to the response body. Otherwise you'll get a PHP error message. If you are unable to place the function call at the top of the page, for some reason, you can use output buffering to circumvent the problem.

    Quote Originally Posted by ralph.m View Post
    That sounds handy. I wasn't able to find the Opera 'Info panel'. Where is it located?
    It's not enabled by default, for some reason. Press Shift+F12, select the Panels tab and check 'Info'. It will then become available with the other panels (at the left edge of the window, by default).
    Birnam wood is come to Dunsinane

  8. #8
    It's all Geek to me silver trophybronze trophy
    ralph.m's Avatar
    Join Date
    Mar 2009
    Location
    Melbourne, AU
    Posts
    24,112
    Mentioned
    448 Post(s)
    Tagged
    8 Thread(s)
    Thanks again Tommy. Interesting thoughts.

    Quote Originally Posted by AutisticCuckoo View Post
    The whole point of choosing a character encoding is so that you can avoid the bloated and error-prone entity references and NCRs... I can assure you that reading or writing text copy in Swedish is not fun when looks like 'blåbärsgröt'.
    I see your point.

    Since this is sent as an HTTP response header, it must precede any content written to the response body. Otherwise you'll get a PHP error message.
    Thanks for that. I don't know if I'll ever need to try it, but useful to know.

    Press Shift+F12, select the Panels tab and check 'Info'. It will then become available with the other panels....
    Seems to be a bit different on the Mac, but I got there in the end.
    View > Toolbars > Panels > i ...
    A handy little feature.
    Facebook | Google+ | Twitter | Web Design Tips | Free Contact Form

    Forum Usage: Tips on posting code samples, images and more

    Forrest Gump: "IE is like a box of chocolates: you never know what you're gonna get."


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •