Understanding Unicode

Share this article

One of the tougher issues on the modern web is that of internationalization, often shortened to i18n (‘i’, then 18 more letters, then ‘n’). The world-wide web truly lives up to its name, and even if your site has a local audience you still may find yourself dealing with foreign letters, be they names with German punctuation or quotations in a foreign language.

The ultimate solution to the foreign character problem is Unicode, a truly enormous standard which attempts to document and provide encoding for virtually every character in every language known to man, with space left over for future language developments. A great starting point for understanding Unicode is Tim Bray’s essay, On Unicode. If that leaves you thirsty for more, Joel Spolsky’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) makes for great reading as well.

So now you understand the basis of the i18n problem, how do you go about solving it in your own work with the web? Thankfully, the W3C have two useful documents on the subject as part of their Authoring Techniques for XHTML and HTML Internationalization series: Characters and Encodings 1.0 and Specifying the language of content. Don’t be put off by the long winded titles or the extensive pre-ambles; the meat of these documents is a set of very easy to follow guidelines. Jukka Korpela’s tutorial on character code issues is another excellent resource on the subject.

As a final note, if you’re looking to use Unicode with PHP you may find Keith Devens’ notes on the subject extremely useful. And if you want to test your web applications for character encoding awareness try pasting in the test from Sam Ruby’s Survival guide to i18n.

Simon WillisonSimon Willison
View Author
Share this article
Read Next
Get the freshest news and resources for developers, designers and digital creators in your inbox each week
Loading form