UTF-8 Survival at webtuesday.ch

Tweet

Continuing with the theory that everyone lives in Zurich, doing a talk tonight: “Scripters UTF-8 Survival Guide”. More detail here – feel free to drop by (will post the slides tomorrow)

So this isn’t a complete spam post, a question…

If you’re using Unicode / UTF-8, do you still need HTML entities?

My view is here

With modern web browsers and widespead support for UTF-8, you don’t need htmlentities because all of these characters can be represented directly in UTF-8. More importantly, in general, only browsers support HTML’s special characters – a normal text editor, for example, is unaware of HTML entities. Depending on what you’re doing, using htmlentities may reduce the ability of other systems to “consume” your content.

Who wants to shoot that down?

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://www.lopsica.com BerislavLopac
  • http://www.phppatterns.com HarryF

    don’t forget

    Get’s an honary mention, as do http://www.cs.tut.fi/~jkorpela/chars.html (the best read I’ve found) and http://www.sitepoint.com/blogs/2006/03/15/do-you-know-your-character-encodings/

  • Tim Strehle

    I agree with you – we completely switched to UTF-8 years ago (no HTML entities, the real UTF-8 characters) and haven’t had any serious problem. (Well, we cannot edit umlauts in vi anymore…)

    You might need to use numeric Unicode character entities instead of the actual UTF-8 characters in HTML e-mails, though: Some web-based e-mail clients will display the e-mail embedded within an ISO-8859-1 web page (ignoring you e-mail’s character set). In this environment, Unicode characters will look broken, but numeric character entities seem to work fine.

  • bungle

    What about “less than” and “more than” chars? So I think you generally still need to do some kind of encoding.

  • R. U. Serious

    You need to _escape_ three characters ampersand, less and greater. Has nothing to do with encoding, unicode or charactersets.

    I agree with Harry. The only problem I’ve ancountered is half-knowledge by a few users of the software who were conditioned to think that if non-ascii is not represented with htmlentities then it must be broken (this double negative actually made sense).

    The “vi” problem is easily fixed by using an up-to-date distribution with unicode support. ;)

  • http://www.phppatterns.com HarryF

    You might need to use numeric Unicode character entities instead of the actual UTF-8 characters in HTML e-mails, though: Some web-based e-mail clients will display the e-mail embedded within an ISO-8859-1 web page (ignoring you e-mail’s character set). In this environment, Unicode characters will look broken, but numeric character entities seem to work fine.

    Have yet to entirely figure out what the perfect world solution for when you’ve got UTF-8 and want to use it an email. Aside from HTML email, when you want to place UTF-8 in the subject / body of a text email, do you use base64 or quoted-pritable, combined with right mime headers. Or perhaps convert to UTF-7 (for mail servers that support only 7-bit encoding)? Will be cunningly skipping over that tonight ;)

    What about “less than” and “more than” chars? So I think you generally still need to do some kind of encoding.

    You’re right – should have been more explicit with wording – for the “special five” that are part of XML / HTML markup, you still need htmlspecialchars() – mentioned here.

    Actually that pops up an interesting side note – was browsing the PHP source that implements htmlspecialchars() and htmlentities(), trying to figure out whether htmlspecialchars() would really be OK with UTF-8, without explicitly declaring it.

    In short, both functions are wrappers around the same underlying code and there’s a ton of stuff happening here (hash table lookups, locale checks etc. etc.).

    Given that htmlspecialchars() is a function that tends to get used alot and that it’s offering pretty simple functionality, I wonder what the performance overhead is here, and whether it could be improved on by a userland PHP function?

  • http://www.phppatterns.com HarryF

    I wonder what the performance overhead is here, and whether it could be improved on by a userland PHP function?

    Initial experiments suggest not. It is tempting to consider an alternative, stripped down C implementation though.

  • Jason Batten

    [url=http://www.amazon.com/gp/product/0131867164/104-6815758-1867124?v=glance&n=283155]Core Web Application Development with PHP and MySQL[/url] by Marc Marc Wandschneider has a great chapter called “Strings and Characters of the World”. The whole book is great as focuses on development using UTF-8.

  • http://www.tilllate.com silvanm

    Great talk. Thanks, Harry.

  • Pingback: SitePoint Blogs » Scripters UTF-8 Survival Guide (slides)

  • Ren

    Just reading the presentation, seems to be missing information on _charset_ , which most of the recent browsers now support (IE, Opera & FireFox). Which is provided by the browser explicitly telling which charset was used to encode the form payload.

    https://bugzilla.mozilla.org/show_bug.cgi?id=18643
    http://whatwg.org/specs/web-forms/current-work/#the-charset

  • http://www.phppatterns.com HarryF

    Just reading the presentation, seems to be missing information on _charset_ , which most of the recent browsers now support (IE, Opera & FireFox). Which is provided by the browser explicitly telling which charset was used to encode the form payload.

    That’s a good point (in fact I did mention it in the talk but it’s not in the slides) – to date that’s something I haven’t played with first hand, just read about. What also interests me is the full story on conditions under which browsers would ignore the form accept-encoding=”utf-8″ attribute (if any)

  • Pingback: SitePoint Blogs » UTF-8 Email in PHP with eZ Components

  • Pingback: Nerd Fish » Blog Archive » UTF-8 Email in PHP with eZ Components