By Simon Willison

Understanding Unicode

By Simon Willison

One of the tougher issues on the modern web is that of internationalization, often shortened to i18n (‘i’, then 18 more letters, then ‘n’). The world-wide web truly lives up to its name, and even if your site has a local audience you still may find yourself dealing with foreign letters, be they names with German punctuation or quotations in a foreign language.

The ultimate solution to the foreign character problem is Unicode, a truly enormous standard which attempts to document and provide encoding for virtually every character in every language known to man, with space left over for future language developments. A great starting point for understanding Unicode is Tim Bray’s essay, On Unicode. If that leaves you thirsty for more, Joel Spolsky’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) makes for great reading as well.

So now you understand the basis of the i18n problem, how do you go about solving it in your own work with the web? Thankfully, the W3C have two useful documents on the subject as part of their Authoring Techniques for XHTML and HTML Internationalization series: Characters and Encodings 1.0 and Specifying the language of content. Don’t be put off by the long winded titles or the extensive pre-ambles; the meat of these documents is a set of very easy to follow guidelines. Jukka Korpela’s tutorial on character code issues is another excellent resource on the subject.

As a final note, if you’re looking to use Unicode with PHP you may find Keith Devens’ notes on the subject extremely useful. And if you want to test your web applications for character encoding awareness try pasting in the test from Sam Ruby’s Survival guide to i18n.

  • Sathyaish Chakravarthy

    To this list of useful links on the Unicode character set, I’d like to add a chapter from Steven Roman’s book Win32 API Programming with Visual Basic. It is the best ever resource I have discovered on the subject. The chapter is titled “Strings” and sits in the MSDN April 2001 library as well.

  • As a web developer in South Wales I am often tasked with producing bi-lingual websites. Internationalisation is extremely important, thanks Simon for highlighting it.

  • This is a terribly important topic for global developers – and one not covered enough! Thanks a bunch for the great links and bringing this to our attention.

  • avine

    A couple more sites that are useful for Unicode and i18n:

    The Unicode Consortium

    Sun Globalization Resources

    I18n Guy’s I18n and L10n portal

  • Mattias

    Great that this topic is covered. One thing that I find missing is reference data, ie files with complete alphabets and so on. In Unicode, of course.
    Unicode string generating software is not abundant either, commercial or not.
    However, this might just be me being blind…

  • pfitz

    Thanks for the useful resources. I have to create an english/chinese site shortly and they will come in handy. At this stage I am still trying to work out which chinese set to use :S And a unicode generator of some sort for chinese symbols would be nice.. I’ll keep looking.

Get the latest in JavaScript, once a week, for free.