Do you know your character encodings?

    Kevin Yank

    This entry reproduced from The Tech Times #134.

    Last month, I attended a meeting of the Melbourne chapter of the Web Standards Group, where Richard Ishida, the Internationalization Activity Lead of the W3C gave a remarkably clear presentation of one of the most ignored issues in web development: character encodings.

    Have you ever noticed certain characters on your site not displaying the way they should? Perhaps the curly quotation marks look like little boxes, or the long dashes have been replaced with question marks. Problems like these usually arise from an incomplete understanding of character encodings on the part of the developer responsible for the site.

    I’d go so far as to guess that, in English speaking circles at least, most web developers that have never learned about character encodings, and just deal with the consequences when issues like the above crop up.

    As a site grows to the point where it must address an international audience (or even just an audience that likes curly quotes), however, it’s more and more difficult to ignore these issues. Even worse, in these heady times of daily hack attempts, incorrect handling of character encodings can result in severe security vulnerabilities (as Google recently discovered).

    So what is a character encoding, exactly? Well, let’s start with something it’s not: a character encoding is not a character set.

    Character Sets

    A character set, or more specifically, a coded character set is a set of character symbols, each of which has a unique numerical ID, which is called the character’s code point.

    Some examples of character sets include the 128-character ASCII character set, which is mostly made up of the letters, numbers, and punctuation used in the English language, and the 256-character ISO-8859-1, or Latin 1 character set, which includes all the ASCII characters plus accented and other additional characters used in related languages like French. The most expansive character set in common use is the Universal Character Set (UCS), as defined in the Unicode standard, which contains over 1.1 million code points.

    The first thing to understand is that every HTML document uses Unicode’s UCS, or more accurately the ISO 10646 character set, which is a less involved standard describing the same set of characters. Some older browsers, or less powerful devices, may not support (and thus will not display) the complete character set, but the fact remains that any HTML document may contain any character in the UCS.

    What does vary from document to document is the character encoding, which defines how each of the characters in the UCS is to be represented as one or more bytes in the text data of the page.

    This figure shows the ASCII, ISO-8859-1, and Unicode code points for three characters (the letter ‘A’, the acute-accented letter ‘e’, and the Hebrew letter ‘alef’), and how those characters map to a series of bytes in five common character encodings:

    Some sample character encodings

    Looking first at the character sets, note how the letter ‘A’ is available as a character in all three character sets, but the acute ‘e’ isn’t available in ASCII, and ‘alef’ is only available in Unicode. The fact that characters maintain the same code points across multiple character encodings is due to the fact that ISO-8859-1 was designed as an extension of ASCII, and Unicode in turn was designed as an extension to ISO-8859-1. There are certainly other character sets where the code points of these characters, where they exist, would differ.

    As I mentioned above, however, web pages always use the Unicode character set, so these code points are the only ones that matter for the purposes of web development.

    Character Encodings

    Now take a look at the character encodings in the figure. The first, 7-bit ASCII, dates back long before the days of MS-DOS, and is commonly used today as a “lowest common denominator” in email systems. If an email message contains only characters from the ASCII character set, and those characters are encoded as per their ASCII code points (e.g. the letter A is code point 65, which in hexadecimal (base-16) is 41, so the byte value used to represent it should be 41), then it should be compatible with any Internet email system, no matter how obsolete. Because ASCII contains only 128 code points, only seven of the eight bits in a byte are needed to represent any ASCII character. The byte values in a 7-bit ASCII document will therefore never exceed 7F (that’s 127 in base-10).

    ISO-8859-1 is the default encoding assumed by many browsers and related English-language software. It uses all eight bits of each byte to represent all 256 code points in the ISO-8859-1 character set. Though this provides the characters required for the vast majority of English language documents, as well as documents in many related languages like French, there are plenty of languages that are based on characters not included in this set. Even certain specialized characters in English documents, curly quotes and long dashes for instance, are not a part of ISO-8859-1. This explains why such characters are most often responsible for revealing a character encoding problem.

    To serve the needs of other languages, there are an abundance of character encodings like ISO-8859-1 that make use of the possible byte values to represent a set of 256 characters. Additionally, there are a number of character encodings that use two bytes per character to allow for 65,536 different characters. Commonly used for Chinese and other languages requiring a large number of characters, these encodings are called double-byte character sets (DBCS), even though they are in fact encodings.

    But for documents that may contain characters from any language, the best encodings are those that can address Unicode’s entire UCS. The simplest of these is UTF-32, which simply uses four bytes to represent each UCS character by its code point. ‘A’, which is code point 65 (41 hex) is represented by the four byte values 00 00 00 41, the acute ‘e’ (code point E9 hex) is 00 00 00 E9, and ‘alef’ (05D0 hex) is 00 00 05 D0.

    The problem with UTF-32 is that, because the vast majority of characters in documents occur early in the UCS, almost every character in a given document will begin with two 00 bytes, which is quite a waste. Effectively, most UTF-32 documents will be four times the size of the same documented encoded in a single-byte encoding like ISO-8859-1.

    The UTF-8 and UTF-16 encodings address this by using a variable number of bytes per character. In UTF-8, the most common characters use only a single byte, which is equal to that character’s UCS code point, while less common characters use two, even rarer characters use three, and only the very rarest of characters use four bytes. UTF-16 accomodates a larger set of “common” characters whose two-byte encodings match their UCS code points, reserving three- and four-byte encodings for rarer characters.

    Looking at the figure, you can see that the ‘A’ character has encodings that match its UCS code point in both UTF-8 and UTF-16. The acute ‘e’ and ‘alef’, on the other hand, are less common characters that each have a special two-byte encoding in UTF-8 that differs from its UCS code point. In UTF-16, however, both acute ‘e’e and ‘alef’ are considered common enough to get an encoding that matches their two-byte code points (00 E9 and 05 D0, respectively).

    Make sense? If you’ve followed this far, you’ve grasped all the concepts you need to work intelligently with character encodings. Keep reading to find out how all this affects your work as a web developer.

    Character Encodings and the Web

    Okay, so a character encoding specifies how a set of characters (like Unicode’s UCS, which is used on the web) can be written as bytes in a stored document. So what does this mean to web developers?

    As a web developer, there are two types of text data that you need to deal with: the text that makes up the pages of your site, and the text that is sent by your users’ browsers (usually as a form submission). In each case, you should be aware of the character encoding that is in use, and treat that data accordingly.

    It turns out that the encodings of these two bodies of text data are linked: the default encoding that a browser will use when submitting a form is governed by the encoding of the document that contained the form. A page encoded in ISO-8859-1 will submit its form data in ISO-8859-1, while a page encoded in UTF-8 will submit in UTF-8.

    So the first thing you need to do is pick an appropriate encoding in whichever editor you use to create your web documents. Depending on your editor, this will involve setting a configuration option (e.g. in Dreamweaver), or simply choosing the right encoding when you first save the file (e.g. in Notepad).

    You also need to tell browsers which encoding your documents are using. Browsers cannot guess the character encoding–every document just looks like a series of byte values until an encoding is provided to interpret them. So next you must declare the character encoding of each of your documents. To indicate the encoding of an HTML document, include an appropriate <meta> tag. For ISO-8859-1:

    <meta http-equiv="Content-Type"
        content="text/html; charset=ISO-8859-1" />

    For UTF-8:

    <meta http-equiv="Content-Type"
        content="text/html; charset=UTF-8" />

    Yes, that’s right: you specify the character encoding with an attribute called charset. No wonder people find this stuff confusing!

    You might wonder how a browser can even read this tag if it doesn’t yet know the character encoding, but it turns out that most encodings in popular use have enough characters in common that the simple HTML code leading up to this tag can usually be interpreted by guessing at a simple encoding (say ISO-8859-1), and then starting over if the tag indicates the browser has guessed wrong.

    For CSS and JavaScript files, things are trickier. While the standards offer ways to indicate the encodings of these files, support for these is spotty. If you need to use characters outside the relatively safe ASCII character set in these files, you’ll need to configure your web server to identify the character encoding in HTTP headers that are sent with these files. For example:

    Content-Type: text/css; charset=UTF-8

    You can use the HTTP header approach for HTML documents as well, but you should still include the <meta> tag as backup in case the document is loaded without HTTP headers (e.g. it is loaded directly from the file system with a file:// URL).

    Once you’ve specified an encoding, you can verify that browsers are picking up on it. Open the page in Firefox, right-click the background and choose Page Info. The window that appears will show the character encoding that was used to interpret the document.

    Page Info window

    So all this begs the question, which character set should you be using? Well, in most cases, the answer is UTF-8. It gives you access to a multitude of characters in your documents without significantly increasing the file size, and it’s reasonably backwards-compatible with older browsers and simple devices that do not support Unicode. If, however, you need to use significant quantities of CJK (Chinese, Japanese or Korean) text, which will necessitate a larger character set, then you might find UTF-16 is a more efficient choice.

    That is, unless you’re using PHP. One of the biggest weaknesses of PHP (up to and including PHP 5.1) is that its built-in string functions handle multi-byte character encodings like UTF-8 and UTF-16 incorrectly. PHP was written with the assumption that one byte equals one character, which simply isn’t the case in such encodings. An optional module or library can be used to provide alternative string functions that do support multi-byte characters, but many of the PHP scripts in circulation use the built-in functions, and simply can’t handle Unicode characters as a result.

    This problem will be addressed in PHP 6, where Unicode support will be an integral part of the language, but in the meantime getting PHP to treat Unicode correctly is something of a black art. It’s certainly possible to do–high quality PHP scripts like WordPress and phpBB handle Unicode quite well–but you really need to know your PHP to do it.

    For this reason, PHP-based web sites are commonly written using the ISO-8859-1 encoding. SitePoint’s article and forum pages, for example, are all written using ISO-8859-1.

    As you can probably gather, using ISO-8859-1 has a few disadvantages. For one thing, you’re limited to using that relatively small character set to write your documents. What happens when you need a curly quote, or some other character not found in the ISO-8859-1 set?

    HTML’s answer to this problem is the character entity. I’m sure you’re familiar with these: codes like &rdquo; (right-hand double quotes) and &mdash; (em dash) let you include characters not available in your chosen encoding in your document’s text. For more exotic characters that do not have an easy-to-remember code in HTML, you can use numeric character entitiesreferences instead. To include the character ‘alef’ in an ISO-8859-1 document, for example, you would use either &#1488; or &#x05d0;, the decimal and hexadecimal versions of the character’s UCS code point, respectively.

    Take a moment to absorb the fact that numeric character entities refer to UCS code points for characters, not the byte values for characters in any particular encoding. The numeric character entity for ‘alef’ is the same no matter what encoding you are using in your document.

    So character entities let you deal with characters outside your selected encoding when writing documents, but what about the other side of the coin? How do you deal with characters outside a limited encoding like ISO-8859-1 when it comes to form submissions?

    Sadly, this is one place where browsers have disagreed for a long time, and even today, after much pulling of hair and gnashing of teeth, the solutions that most browsers now support are less than ideal.

    One of the biggest problems is Windows, which on English language systems makes use of a slightly modified version of ISO-8859-1 called Windows-1252. Sam Ruby has documented the differences in his survival guide. Windows-1252 represents certain useful characters like curly quotes as single bytes, taking the places of less commonly used ISO-8859-1 characters. As a result, Internet Explorer browsers will often consider such characters as being within the document encoding, and will submit them as such. On the server, these single-byte encodings get interpreted as their ISO-8859-1 equivalents, which is what often leads to ugly boxes and other nonsense characters showing up on web pages in the place of curly quotes and the like, particularly when text entered on a Windows system is displayed on a non-Windows browser like Safari.

    That exception aside, most current browsers, when faced with a character that is not in the encoding in which a form is to be submitted, will convert that character to a numeric character entity and submit that instead. This may sound sensible at first, but consider that HTML forms are supposed to submit plain text, not HTML code. Special characters like < and > are not automatically encoded as &lt; and &gt; for submission by forms, nor should they be. This auto-conversion of out-of-encoding characters means that, in an ISO-8859-1 document, you can’t tell from the submitted form data whether the user actually typed the character ‘alef’, or the series of characters &#1488;.

    Some browsers have approached this problem differently, replacing certain out-of-encoding characters with in-encoding equivalents (e.g. curly quotes with straight quotes), and replacing other problem characters with a generic substitute (e.g. ‘?’). While this solution is technically superior, you do miss out on the few cases where the more common approach described above manages to preserve the desired characters without any side-effects.

    A full discussion of how different browsers tackle the problem of character encoding in form submissions would take too long to go into here, but there are good writeups available for those who go looking. In short, however, your best bet for conquering these problems is to move your site to UTF-8 (or UTF-16 if appropriate) as soon as you can.

    Further Reading

    Much of the information above in this issue is distilled from the second hour of a talk that Richard Ishida gave to the Melbourne Web Standards Group not long ago. If I’ve piqued your interest but you’re still a bit foggy on the details, you can listen to the complete audio of that presentation, and read through his slides, enhanced with complete tutorial notes.

    Once you start working with Unicode, you’ll find a number of utilities on Ishida’s
    site will come in very handy. There’s a tool for browsing the complete UCS, and another for converting between Unicode characters, code points, encodings, and numeric character entities, both of which are definitely worth bookmarking.

    Updated: The code point for the acute ‘e’ was wrong in the original version of this article.