Continuing with the theory that everyone lives in Zurich, doing a talk tonight: “Scripters UTF-8 Survival Guide”. More detail here - feel free to drop by (will post the slides tomorrow)
So this isn’t a complete spam post, a question…
If you’re using Unicode / UTF-8, do you still need HTML entities?
My view is here…
With modern web browsers and widespead support for UTF-8, you don’t need htmlentities because all of these characters can be represented directly in UTF-8. More importantly, in general, only browsers support HTML’s special characters - a normal text editor, for example, is unaware of HTML entities. Depending on what you’re doing, using htmlentities may reduce the ability of other systems to “consume” your content.
Who wants to shoot that down?





August 8th, 2006 at 9:16 pm
Of course, don’t forget the beautifully titled The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
August 8th, 2006 at 9:34 pm
Get’s an honary mention, as do http://www.cs.tut.fi/~jkorpela/chars.html (the best read I’ve found) and http://www.sitepoint.com/blogs/2006/03/15/do-you-know-your-character-encodings/
August 8th, 2006 at 10:28 pm
I agree with you - we completely switched to UTF-8 years ago (no HTML entities, the real UTF-8 characters) and haven’t had any serious problem. (Well, we cannot edit umlauts in vi anymore…)
You might need to use numeric Unicode character entities instead of the actual UTF-8 characters in HTML e-mails, though: Some web-based e-mail clients will display the e-mail embedded within an ISO-8859-1 web page (ignoring you e-mail’s character set). In this environment, Unicode characters will look broken, but numeric character entities seem to work fine.
August 8th, 2006 at 10:37 pm
What about “less than” and “more than” chars? So I think you generally still need to do some kind of encoding.
August 8th, 2006 at 10:48 pm
You need to _escape_ three characters ampersand, less and greater. Has nothing to do with encoding, unicode or charactersets.
I agree with Harry. The only problem I’ve ancountered is half-knowledge by a few users of the software who were conditioned to think that if non-ascii is not represented with htmlentities then it must be broken (this double negative actually made sense).
The “vi” problem is easily fixed by using an up-to-date distribution with unicode support. ;)
August 8th, 2006 at 11:18 pm
Have yet to entirely figure out what the perfect world solution for when you’ve got UTF-8 and want to use it an email. Aside from HTML email, when you want to place UTF-8 in the subject / body of a text email, do you use base64 or quoted-pritable, combined with right mime headers. Or perhaps convert to UTF-7 (for mail servers that support only 7-bit encoding)? Will be cunningly skipping over that tonight ;)
You’re right - should have been more explicit with wording - for the “special five” that are part of XML / HTML markup, you still need htmlspecialchars() - mentioned here.
Actually that pops up an interesting side note - was browsing the PHP source that implements htmlspecialchars() and htmlentities(), trying to figure out whether htmlspecialchars() would really be OK with UTF-8, without explicitly declaring it.
In short, both functions are wrappers around the same underlying code and there’s a ton of stuff happening here (hash table lookups, locale checks etc. etc.).
Given that htmlspecialchars() is a function that tends to get used alot and that it’s offering pretty simple functionality, I wonder what the performance overhead is here, and whether it could be improved on by a userland PHP function?
August 8th, 2006 at 11:41 pm
Initial experiments suggest not. It is tempting to consider an alternative, stripped down C implementation though.
August 9th, 2006 at 12:03 am
[url=http://www.amazon.com/gp/product/0131867164/104-6815758-1867124?v=glance&n=283155]Core Web Application Development with PHP and MySQL[/url] by Marc Marc Wandschneider has a great chapter called “Strings and Characters of the World”. The whole book is great as focuses on development using UTF-8.
August 9th, 2006 at 8:39 am
Great talk. Thanks, Harry.
August 9th, 2006 at 9:30 am
[…] Blog Post Blogs » PHP » Scripters UTF-8 Survival Guide (slides) « UTF-8 Survival at webtuesday.ch […]
August 9th, 2006 at 10:49 pm
Just reading the presentation, seems to be missing information on _charset_ , which most of the recent browsers now support (IE, Opera & FireFox). Which is provided by the browser explicitly telling which charset was used to encode the form payload.
https://bugzilla.mozilla.org/show_bug.cgi?id=18643
http://whatwg.org/specs/web-forms/current-work/#the-charset
August 9th, 2006 at 11:08 pm
That’s a good point (in fact I did mention it in the talk but it’s not in the slides) - to date that’s something I haven’t played with first hand, just read about. What also interests me is the full story on conditions under which browsers would ignore the form accept-encoding=”utf-8″ attribute (if any)
August 15th, 2006 at 12:06 am
[…] One of the subjects I brushed over last week was how you handle UTF-8 in email, because I don’t have a full picture on the best way to solve this. The fundamental problem is summarized nicely on Wikipedia’s discussion of MIME; The basic Internet e-mail transmission protocol, SMTP, supports only 7-bit ASCII characters […]. This effectively limits Internet e-mail to messages which, when transmitted, include only the characters sufficient for writing a small number of languages, primarily English. Other languages based on the Latin alphabet typically include diacritics not supported in 7-bit ASCII, meaning text in these languages cannot be correctly represented in basic e-mail. […]
August 15th, 2006 at 2:55 pm
[…] One of the subjects I brushed over last week was how you handle UTF-8 in email, because I don’t have a full picture on the best way to solve this. The fundamental problem is summarized nicely on Wikipedia’s discussion of MIME; The basic Internet e-mail transmission protocol, SMTP, supports only 7-bit ASCII characters […]. This effectively limits Internet e-mail to messages which, when transmitted, include only the characters sufficient for writing a small number of languages, primarily English. Other languages based on the Latin alphabet typically include diacritics not supported in 7-bit ASCII, meaning text in these languages cannot be correctly represented in basic e-mail. […]