SitePoint Sponsor

User Tag List

Results 1 to 6 of 6
  1. #1
    I meant that to happen silver trophybronze trophy Raffles's Avatar
    Join Date
    Sep 2005
    Location
    Tanzania
    Posts
    4,662
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)

    UTF-8 vs. ISO-8859-1

    I have been led to believe that UTF-8 is a larger character set than ISO-8859-1 and that it is increasing in popularity and use throughout the internet.

    However, I have noticed that if the page is served as UTF-8, the pound sign shows up with the error character �. Now, I know that to display it the HTML entity £ ought to be used. However, if UTF-8 can't display it when not using £ and ISO-8859-1 can, why is everyone moving towards UTF-8? What's wrong with good old ISO-8859-1?
    Last edited by Raffles; Sep 24, 2006 at 23:08.

  2. #2
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The character encoding that you specify for a web page must match the encoding you used when saving your file. If you save your file as ISO-8859-1 and declare the encoding as UTF-8 (or vice versa) there'll be problems if you use characters outside the ASCII range.

    ISO-8859-1 is both a character repertoire ('character set') and an encoding. It's a straight single byte one-to-one encoding, which means it contains 256 positions (0x00-0xFF). Quite a few of those are reserved for control characters (C0 in 0x00-0x1F and C1 in 0x80-0x9F). That leaves 192 printable characters, which is enough for simple texts in most Western European languages. Unfortunately, ISO-8859-1 doesn't include some very useful and common characters, like proper quotation marks and dashes. It also doesn't contain the Euro currency character (€). (ISO-8859-15 is meant to replace ISO-8859-1, and contains the Euro sign.)

    UTF-8 is an encoding for the Unicode character repertoire. It uses between one and six bytes to encode each character and can thus represent any Unicode character. The first 128 characters (0x00-0x7F) are encoded identically to ISO-8859-1.

    The character repertoire used in HTML is ISO-10646, which is virtually the same as Unicode. Both UTF-8 and ISO-8859-1 (and many others) can be used as the encoding, but ISO-8859-1 is much more limited since it can only represent the first 256 characters (of which only 192 are printable).

    If you want to include a character that cannot be represented in your chosen encoding, you can use character entities (e.g., £) or numeric character references (£ or £).

    This character is encoded differently in ISO-8859-1 and UTF-8. If you include a literal sign, and your declared encoding doesn't match the encoding in which you saved your file, the pound sign will not display correctly.

    If you want to use UTF-8, you must save your file with an encoding of UTF-8 and declare the encoding to be UTF-8. The encoding can be declared using the charset attribute in the Content-Type HTTP header, e.g.:
    Code:
    Content-Type: text/html; charset=utf-8
    If the encoding is not specified in the HTTP header, you can specify it using a META element:
    HTML Code:
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    Such a META element will be ignored if the information is sent in the real HTTP headers, though, but it can be useful for when the document is saved to disk and viewed locally.

    For (real) XHTML, the encoding should be specified in the XML declaration (and omitted from the HTTP header):
    Code:
    <?xml version="1.0" encoding="utf-8"?>
    This will only be applied if the document is served with an XML MIME type (preferably application/xhtml+xml). In that case, any META equivalent will be ignored.

    XML parsers are only required to support UTF-8 and UTF-16. XML parsers used in web browsers are likely to support the same range of encodings as the accompanying HTML parsers, but if you want to be on the safe side you should only use UTF-8 or UTF-16 for XML (including XHTML).
    Birnam wood is come to Dunsinane

  3. #3
    I meant that to happen silver trophybronze trophy Raffles's Avatar
    Join Date
    Sep 2005
    Location
    Tanzania
    Posts
    4,662
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    That was wonderfully helpful, thank you very much. It did not occur to me to think about what I was saving things as.

    How can I find out what existing files are saved as? I've opened them up in my text editor (Notepad2) and it says the encoding is "ANSI", but I think that is probably the default.

  4. #4
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I haven't used Notepad2, but I'd guess that it uses ISO-8859-1 (or Windows-1252) as the default. Windows-1252 is a Microsoft-specific version of ISO-8859-1 that uses the range reserved for C1 control characters (0x80-0x9F) for some useful characters (nice quotation marks, dashes, etc.).

    You'll need to look at the file using a tool that can show you the exact character codes. I use the Vim editor which can do this, but you could also use any DUMP utility. If you know any programming language, it would be a trivial exercise to write a program that displays the character codes.

    Look at the code for the pound sign. If it's 163 (decimal) or A3 (hex), then you're probably using ISO-8859-1, although it could also be Windows-1252. If it's two bytes (C2 A3), you're using UTF-8.
    Birnam wood is come to Dunsinane

  5. #5
    bronze trophy
    Join Date
    Dec 2004
    Location
    Sweden
    Posts
    2,670
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    In Notepad2 "ANSI" is windows-1252, "Unicode" is utf-16le, "Unicode Big Edian" is utf-16be, "UTF-8" is utf-8 and "UTF-8 with Signature" is utf-8 with BOM.

    Copy the contents of the file, select File > Encoding > UTF-8, then paste and save. Done.

    When you open files that only contain ascii bytes then Notepad2 will assume windows-1252 (known bug, there should be a pref or it should honor the default). To prevent this include the BOM or a non-ascii character somewhere.

    If you have many files perhaps you want to automate the conversion (using iconv for instance).
    Simon Pieters

  6. #6
    I meant that to happen silver trophybronze trophy Raffles's Avatar
    Join Date
    Sep 2005
    Location
    Tanzania
    Posts
    4,662
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Ah, that's wonderful, PHP has the iconv library preinstalled. I'll write a script (I have 73 files that need converting).

    Thanks again for your help, I've learned quite a few new things.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •