SitePoint Sponsor

User Tag List

Results 1 to 9 of 9
  1. #1
    SitePoint Zealot janislanka's Avatar
    Join Date
    Sep 2004
    Location
    Lithuania
    Posts
    191
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    which charset to use

    there are so many charsets I have been dealing with:
    charset=windows-1257
    charset=iso-8859-13
    charset=utf-8


    but I still don't understand what might be the differece...ok...well, better question...WHICH is better to use?

    Janis

  2. #2
    Non-Member Egor's Avatar
    Join Date
    Jan 2004
    Location
    Melbourne, Australia
    Posts
    7,305
    Mentioned
    1 Post(s)
    Tagged
    1 Thread(s)
    I use charset=utf-8. Never had problems.

  3. #3
    SitePoint Zealot janislanka's Avatar
    Join Date
    Sep 2004
    Location
    Lithuania
    Posts
    191
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    ok, some languages that I might be using:
    Latvia, Lithuanian, Russian, English.

    English and Lithuanian are used the most. Often we have text in MS Word that we have to put in DreamWeaver. And all the time I have to do some funny convertings that I'm thinking shoudl be avoided.

    Also, what are thier differences? Who am I loosing if I use UTF or ISO?

  4. #4
    The CSS Clinic is open silver trophybronze trophy
    Paul O'B's Avatar
    Join Date
    Jan 2003
    Location
    Hampshire UK
    Posts
    40,556
    Mentioned
    183 Post(s)
    Tagged
    6 Thread(s)
    Did you check the link in the answer I posted in this thread?:

    http://www.sitepoint.com/forums/showthread.php?t=269747

    And some more info here:

    http://www.w3.org/TR/REC-html40/charset.html
    http://www.w3.org/International/O-charset.html
    http://www.unicode.org/iuc/iuc10/languages.html

    Looks like utf-8 is the one you should be using

  5. #5
    bronze trophy
    Join Date
    Dec 2004
    Location
    Sweden
    Posts
    2,670
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If you are going to use different languages then UTF-8 is probably better. The ISO encodings are single byte, so you can only use a limited number of characters. UTF-8 is multibyte, which means that you can use any character you want, without having to use character references or entities.
    Simon Pieters

  6. #6
    SitePoint Zealot janislanka's Avatar
    Join Date
    Sep 2004
    Location
    Lithuania
    Posts
    191
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    OK, so, UTF-8 seems to be a better choice at this moment. But all the places I read about any charset, I can't find the problems that "UTF" might have. I mean, if it's that good, why it's not used more? What are the drawbacks that I have to know of?

  7. #7
    SitePoint Zealot Kaystarmaker's Avatar
    Join Date
    Jan 2005
    Location
    The Netherlands
    Posts
    183
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    most sites are english, so they can just use ISO, and next to that, most WYSIWYG editors are using ISO (...for whatever reason) so alot of people dont pay attention to charsets and just use what they are given i dont think utf-8 has any drawbacks...it might become slightly bigger because of more bits per letter?

  8. #8
    bronze trophy
    Join Date
    Dec 2004
    Location
    Sweden
    Posts
    2,670
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Since utf-8 is multibyte, it is harder to programatically count "characters", since some characters take up more than one byte. I don't think this is a problem, but it's something to be aware of.

    If you use multiple languages, the files may get smaller with utf-8 than with a single byte encoding. If you want to use characters that are out of range of the single byte encoding, you have to escape the characters with character references. For instance, the ☺ 'WHITE SMILING FACE' (U+263A) character takes up three bytes in utf-8, but the string "☺" takes seven bytes. There's no need to escape characters in utf-8, exept for "processing characters" of course (< > & " ).

    Another thing to be aware of is if you are going to send e-mails though a form. The "To", "From" and "Subject" headers must be encoded in base64 if you use utf-8. In PHP, it can be done like this:
    PHP Code:
    $subject "=?UTF-8?B?" base64_encode($subject) . "?="
    Simon Pieters

  9. #9
    The CSS Clinic is open silver trophybronze trophy
    Paul O'B's Avatar
    Join Date
    Jan 2003
    Location
    Hampshire UK
    Posts
    40,556
    Mentioned
    183 Post(s)
    Tagged
    6 Thread(s)
    Here are a couple more useful links.

    http://www.joelonsoftware.com/articles/Unicode.html
    http://annevankesteren.nl/2004/06/utf-8

    But I think zcorpan's posts are just as useful


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •