SitePoint Sponsor

User Tag List

Page 1 of 4 1234 LastLast
Results 1 to 25 of 93
  1. #1
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    The Definitive Guide to Web Character Encoding

    Notice: This is a discussion thread for comments about the SitePoint article, The Definitive Guide to Web Character Encoding.
    __________

    Thanks for this very good article.

    However as many US-English speakers, the author ignores that UTF-8 actually misses its goal and *in facts* doesn't work. When you write in UTF-8, your text sure goes through in perfect form as long as you don't write any character outside the ASCII 127 first, or as long as the recipient reads it in its original document. But as soon as he uses it elsewhere, e.g. by Replying or Forwarding, each European accentuated character will cripple 2 or 3 characters around it, making the document unusable.

    Sure this will eventually get fixed, but so far, if you want to write European languages properly, you write in ISO 8859-1. The only lack of it in real world is the Euro *typographical* symbol, which you appropriately replace with the Euro *financial* symbol, EUR, which is more widely officially standardized, and actually read and understood by any person or program in any country in the world, from Thailand to USA to Germany.

    I gave more details in newsgroup MS Public Outlook Express General, e.g. in message « For Long URLs, Accentuated Chars, encode as Quoted-Printable, Western European (ISO), use "EUR" for Euro symbol » posted Sun 19 Nov 2006 18:56:45 GMT.

    Versailles, Wed 10 Jan 2007 10:57:55 +0100, edited 11:06:50

  2. #2
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    My news links are clickable (follow "View all comments" and see my edited message).

    Versailles, Wed 10 Jan 2007 11:10:10 +0100

  3. #3
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Michel, what you are describing is not a shortcoming of UTF-8 in any way. It is the problem that arises when components in a communications chain use different encodings.

    Forwarding a Polish phrase from a site encoded in Windows-1250 will also result in gibberish if the recipient uses ISO 8859-1.

    So I will respecfully disagree with you when you say that UTF-8 "doesn't work". It works very well, but of course there will be problems if someone uses text encoded as UTF-8 and declares the encoding as ISO 8859-1. The opposite is also true, so using the same argument one could say that ISO 8859-1 "doesn't work". In fact, there is not a single character encoding that "does work" under those premises. :)

  4. #4
    SitePoint Member bhutz's Avatar
    Join Date
    May 2004
    Location
    Bedford
    Posts
    13
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks Tommy, great article!

  5. #5
    Johan De Silva
    SitePoint Community Guest
    I receive all of my web copy in Word and copy paste into CMS or via Dreamweaver. i presume using Windows-1252 would be the best for this?

  6. #6
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    AutisticCuckoo, please read the message I linked (and the analog ones in this Newsgroup and the simlar ones as "IE General" and "Windows XP General"), make sure you are using the appropriate settings for writing *AND FOR READING* in Internet Explorer *AND IN OE*, send yourself a message with European characters, try to edit it (with Replying or Forwarding), and you will see that, even if properly set on the receiving side to write in UTF-8 and to read in "Auto-Select", your UTF-8-encoded forwarded text will have 2 or 3 characters crippled around every European accentuated character (you may have to save the message and reopen it to see the problems. In addition this was in IE6, so far I didn't re-check with IE7).

    Where I agree is that, as you seem to imply, the flaw is IMO not necessarily inside the UTF-8 coding itself, but may be more in the insufficient efforts in making UTF-8 easy to understand and applied, and accordingly sufficiently widely spread and applied - in particular inside IE and OE themselves.

    Versailles, Wed 10 Jan 2007 15:48:35 +0100, edited (added links to NGs) 15:54:40

  7. #7
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I don't use Internet Explorer (I use Linux) and the links you posted don't work for me (Opera).

    Without reading the articles, it seems to me that any problems exist because Microsoft doesn't handle UTF-8 correctly in its applications. That's not a fault within the encoding, it's a fault in Microsoft's software. Without waxing philosophical, how long are we going to let a big corporation with a poor QA department hold back development?
    Birnam wood is come to Dunsinane

  8. #8
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Unfortunately you are right about MS (unduly) not applying in its products what it (duly) requires from others. But the facts is that, in real world (where a majority are using IE and OE) and at the user level, messages sent in UTF-8 don't work properly, and American users (who generally are more careful at writing properly) tend to write European texts in UTF-8, apparently ignoring that in facts UTF-8 is misproperly handling accentuated characters, i.e. missing its main goal.

    Versailles, Wed 10 Jan 2007 16:15:25 +0100

  9. #9
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    To open the NEWS links above:

    To open the news links above:

    Clicking them should open them in any properly configured browser+newsreader (IE+OE, FF+TB+NGs, etc). In case this fails, try the following:
    1. Right-Click MS Public Outlook Express General, chose "Copy Shortcut", then Left-Click in the Address Bar in your preferred browser, and type <Ctrl>V, which will paste the URL, and hit ENTER. This will open your News Reader (usually OE or TB) on the MS Public OE NG (Microsoft Public Outlook Express General); then if necessary you will be proposed to create an account to MS News server, in such case do accept. Such an account is free (good), and doesn't require a PWD (bad). This again should work in any properly configured browser+newsreader.
    2. Then try clicking my links. The one to the message (« For Long URLs, Accentuated Chars, encode as Quoted-Printable, Western European (ISO), use "EUR" for Euro symbol ») should open the said message in its own window; in case it wouldn't, then again, Right-Click, Copy Shortcut and paste in browser Address Bar.
    Versailles, Wed 10 Jan 2007 16:39:45 +0100, edited (replacing URLs with complete links so to not break layout in some users' browsers) Thu 11 Jan 2007 15:38:25 +0100
    Last edited by Michel Merlin; Jan 11, 2007 at 08:38.

  10. #10
    The CSS Clinic is open silver trophybronze trophy
    Paul O'B's Avatar
    Join Date
    Jan 2003
    Location
    Hampshire UK
    Posts
    40,529
    Mentioned
    182 Post(s)
    Tagged
    6 Thread(s)
    Thanks for the article Tommy - very interesting and informative

  11. #11
    Robert Wellock silver trophybronze trophy xhtmlcoder's Avatar
    Join Date
    Apr 2002
    Location
    A Maze of Twisty Little Passages
    Posts
    6,316
    Mentioned
    60 Post(s)
    Tagged
    0 Thread(s)
    Unfortunately I have a Firewall stopping me viewing the newsgroup so I cannot see what was written though I'd suspect the said Microsoft applications "lack functionally". I would guess Mr T's. article was focusing more on webpages.

    As for 'Sandwich Table', I don't know? You probably over completed certain parts of the article or didn't emphasise enough on the difference between HTML and XHTML (The charset meta declaration is not recognized by XML processors you mentioned brefily). Though not how external CSS and such will get treaded by the two different processors, i.e. @charset "utf-8";

    I think you should have also mentioned if you only declare via the META element it should be the first thing that appears after the opening HEAD tag.

    Other than that it more-or-less covered most things in roundabout TOOL way.

  12. #12
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    There aren't too many differences between HTML and XHTML when it comes to character encoding. I mentioned that in real XHTML you should use the XML declaration, while in pretend-XHTML you could use a META elements like the HTML it really is. In either case, a true HTTP header sent by the web server will override.

    You are correct about other external files, though. I did realise (too late) that I forgot to include information about encoding for CSS and JavaScript files. Presumably, people would use the same editor settings, though, so it should work out.

    The META element doesn't have to be the first thing after the <head> tag, but there should be no characters outside the US-ASCII range preceding it.
    Birnam wood is come to Dunsinane

  13. #13
    SitePoint Enthusiast
    Join Date
    Jul 2006
    Posts
    38
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for the article Tommy. I enjoy reading your articles because they're informative, and your use of words is extremely precise.

    Anyways, I had a question regarding the display of numbers in a different language. I would like to display Arabic numbers on a page, which seems to work in IE7, but not in Firefox. The test code I used is:

    HTML Code:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
    <html>
    <head>
    <title>Check</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    </head>
    <body>
    <p><span>من</span><span> 123</span></p>
    <p lang="ar"><span> 123</span></p>
    </body>
    </html>
    It would appear that numbers are automatically converted to Arabic when they exist within the same block as Arabic writing (e.g. first paragraph in the sample code), and when the lang attribute is set to ar (e.g. second paragraph in the sample code). This is for IE7.

    Is the issue here the support of fonts in the different browsers, or their support for the encoding?

  14. #14
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    It works as expected in Opera, Firefox and IE6. I don't have access to IE7. I don't see any numbers 'converted to Arabic', and there shouldn't be any. If you want Arabic numerals, you need to use the appropriate characters (U+0600 to U+0669).

    A browser converting '1' into '١' because the lang attribute is set to 'ar' is violating specifications.

  15. #15
    Mondain
    SitePoint Community Guest
    Just to point out a little trivia - Arabic numerals are the ones that Americans use everyday.. 123456 and such. The numerals that most Arabs use are called Indic and originate from India.

  16. #16
    SitePoint Addict
    Join Date
    May 2006
    Location
    Amsterdam
    Posts
    206
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Currently I'm working on a project that involves Scandinavian, Eastern and Western European Countries.

    The Scandinavians use characters such as:
    &#229; (U+00E5, &#38;#229; )
    &#216; (U+00D8, &#38;#216; )
    &#196; (U+00C4, &#38;#196; )

    The Eastern Europeans use characters such as:
    Ł (U+0141, &#38;#321; )
    Ś (U+015B, &#38;#347; )
    ą (U+0105), &#38;#261; )

    The Western Europeans use characters such as:
    &#233; (U+00E9, &#38;#233; )
    &#241; (U+00F1, &#38;#241; )
    &#252; (U+000FC, &#38;#252; )


    In PHP there are two functions that I’ve used which help out when handling the character sets:
    1. htmlentities(). This function works well for interpreting user input and placing the appropriate character entity in the db. You can specify whether or not to convert quotes and which character set you wish to use: ISO-8859-1, UTF-8, etc.
    2. html_entity_decode() is the flip-side of htmlentities() it decodes the character entity received from the database and matches htmlentities() quote and character set methods.


    When I originally set-up the PHP code I decided to use UTF-8 encoding thinking it would be the best way to handle the various characters but I ran into two problems.
    1. The first was that UTF-8 didn’t handle all of the characters I needed one of which is the €.
    2. The second is a PHP bug when using html_entity_decode with UTF-8 specified as the character set. The report says that the status is ‘won’t fix’. Apparently this is a pre-PHP 5 issue. The solution uses PHP 4.3.11 - migrating to PHP 5 is not in the scope of the project.


    With the help of this article I realized that ISO-8859-15 would work best (it includes the € symbol plus a few other European characters not included in UTF-8) and so it does – no problems with html_entity_decode and no missing characters.

    Thanks again Tommy for a nice article,
    Dan

  17. #17
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Using htmlentities() may be necessary if you use a limited encoding, such as the 256-character ISO 8859 series. With UTF-8 you never have to use such awkward functions. htmlentities() bloats the file size by converting everything outside the US-ASCII range into NCRs.

    UTF-8 most certainly handles the Euro character, U+20AC. It will be encoded with three octets: E2 82 AC (226, 202, 172). This character is not available in ISO 8859-1, but it is in ISO 8859-15 where it has code point 0xA4.

    PHP doesn't have native UTF-8 support (yet), but you can still use it if you are aware of the caveats. For instance, strlen() may report the wrong length since it assumes that every character is one octet.
    Birnam wood is come to Dunsinane

  18. #18
    bronze trophy
    Join Date
    Dec 2004
    Location
    Sweden
    Posts
    2,670
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    htmlentities() doesn't work well with UTF-8, or so I heard. If you need escaping then use htmlspecialchars() instead.
    Simon Pieters

  19. #19
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If you use UTF-8, there is no reason whatsoever to use htmlentities(). Any valid ISO 10646 character can be natively represented in UTF-8.
    Birnam wood is come to Dunsinane

  20. #20
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    I went back from Latin 9 ISO (8859-15) to Western European ISO (8859-1)

    Quote Originally Posted by danNL View Post
    With the help of this article I realized that ISO-8859-15 would work best (it includes the € symbol plus a few other European characters not included in UTF-8) and so it does – no problems with html_entity_decode and no missing characters.
    I followed the same path as you, up to Latin 9 ISO (8859-15). However I went back for now to Western European ISO (8859-1). The difference is small (see its description and presentation), and 8859-15 is spreading fast, but there are still many people who read and reply using 8859-1 (for a start, people handling their mail on Yahoo Mail), and then the Euro Symbol (€) will be sometimes kept as such, sometimes changed in Currency Symbol (¤), sometimes in square (‡) or question mark (?). Since for the rest the differences are quite small and I prefer 8859-1 (I use the fractions, not the Czech chars, and while I do use them I can do sans Ÿ, œ and Œ), then I will wait still a little. Meanwhile, 8859-1 protects me from being ambiguous since it forces me (or the message won't be saved or sent) to replace the typographical Euro Symbol "€" with the financial one "EUR", which is immediately and unambiguously recognized by any person or program in the world, from Thailand to USA, in any profession, from financial traders to shoe shiners.

    Versailles, Thu 18 Jan 2007 22:20:40 +0100

  21. #21
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    I first went back from UTF-8 to ISO-8859-1

    I forgot to recall that I first went back from UTF-8 to 8859-1 (before 8859-15). As I mentioned above, UTF-8 fails in facts and in real world to reliably achieve its main goal - rendering of all characters. Is it UTF-8, or OE (Outlook Express), or MS, or me, I am not sure. I did a few tests with IE (without OE), or with First Page 2006 (which returned results similar but not identical), but none with FF+TB or with Linux. If anyone has a clue, thanks if you could go and post it after my tests on Newsgroup: MS Public OE General, starting with Message: OE can't edit HTML source of UTF-8 European messages, Posted: Tue 16 Jan 2007 23:35:20 +0100 (22:35:20 GMT).

    Versailles, Thu 18 Jan 2007 22:47:10 +0100

  22. #22
    Sean
    SitePoint Community Guest
    Forking into the issue of PHP 4's lack of support for Unicode, does this explain why when I upload a text file saved as UTF-8 into a blob field in a MySQL database and serve the contents via PHP, I get that byte-order mark which shouldn't be printed? I've had to use ANSI encoding to avoid this issue thus far.

  23. #23
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    @Michel: UTF-8 is not a failure. Microsoft Outlook Express is, if it cannot handle such a common encoding. Your argument is like saying that CSS is a failure because IE doesn't support it properly.

    @Sean: Since PHP4 doesn't have native UTF-8 support, I very much doubt that it will output a BOM without your saying so. A visible BOM may indicate that your web server is declaring the encoding as, e.g., ISO 8859-1.
    Birnam wood is come to Dunsinane

  24. #24
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    OK, OE bad. Then, which one is good (for UTF-8)?

    Quote Originally Posted by AutisticCuckoo View Post
    @Michel: UTF-8 is not a failure. Microsoft Outlook Express is, if it cannot handle such a common encoding.
    Of course OE is a failure in this issue, but how are you sure UTF-8 is not? Please show me a test of another mail handler than OE (e.g. TB) that would render and edit properly the test I linked.

    For instance UTF-8 also fails in First Page 2006:
    Code:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
    Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
    <html><head><title>Latin 9 (iso-8859-15) « À CURAÇAO, Éric
    n'a donné à Françoise Spaßmann que 1?+1£+$1±5% »</title>
    <META http-equiv=Content-Type content="text/html; charset=iso-8859-15">
    <STYLE>BODY {BACKGROUND: white; FONT: 10pt arial;COLOR: black} </STYLE>
    </head>
    <body>
    <DIV>Latin 9 (iso-8859-15) «&nbsp;À CURAÇAO, Éric n'a donné
    à Françoise Spaßmann que 1€+1£+$1±5%&nbsp;»</DIV>
    </body></html>
    displays (in Preview pane):
    Latin 9 (iso-8859-15) « À CURAÇAO, Éric n'a donné à Françoise Spaßmann que 1€+1£+$1±5% »
    but replacing "Latin 9 (iso-8859-15)" with "Unicode (UTF-8)":
    Code:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
    Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
    <html><head><title>Unicode (UTF-8) « À CURAÇAO, Éric n'a
    donné à Françoise Spaßmann que 1?+1£+$1±5% »</title>
    <META http-equiv=Content-Type content="text/html; charset=utf-8">
    <STYLE>BODY {BACKGROUND: white;FONT: 10pt arial;COLOR black}</STYLE>
    </head>
    <body>
    <DIV>Unicode (UTF-8) «&nbsp;À CURAÇAO, Éric n'a donné
    à Françoise Spaßmann que 1€+1£+$1±5%&nbsp;»</DIV>
    </body></html>
    now displays:
    Unicode (UTF-8) � � CURA�AO, �ric n'a donn� � Fran�oise Spa�mann que 1�+1�+$1�5% �
    and this is for display only; still remains to edit...

    So, please show me (with tests) which mail/news handler I could chose to get UTF-8 properly handled (I am quite open, and sure some exist). TIA,

    Versailles, Fri 19 Jan 2007 10:35:20 +0100

  25. #25
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Michel Merlin View Post
    Of course OE is a failure in this issue, but how are you sure UTF-8 is not?
    Because UTF-8 has very simple rules for how to encode any character in the Unicode/ISO 10646 repertoire. It cannot fail.

    Quote Originally Posted by Michel Merlin View Post
    Please show me a test of another mail handler than OE (e.g. TB) that would render and edit properly the test I linked.
    The article mainly concerns character encoding for websites. I don't think it is fair to suggest that everyone use limited encodings and bloat their pages with NCRs just because some email clients are buggy. ISO 8859-15 may work well for you, but what about authors who want to publish in Chinese?

    I use Opera's built-in email client and it has no problems with UTF-8.

    Copying text from one application and pasting it into another can cause all sorts of issues, depending on the operating system and any intermediate clipboard applications. That is not because of any problems with any one character encoding, but because software vendors cannot agree to use a single encoding (or even repertoire) that works for all needs and languages.

    The problems you are describing are like comparing languages. If I copy a passage in French from the web and email it to my brother, he won't understand it. That doesn't mean there's anything wrong with French (or my brother) or that all web pages should be written in Swedish.

    BTW, that editing software you linked to doesn't seem to be worth its price. Software that claims to support XHTML but doesn't handle UTF-8 is laughable.
    Birnam wood is come to Dunsinane


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •