Do you know your character encodings?

Tweet

This entry reproduced from The Tech Times #134.

Last month, I attended a meeting of the Melbourne chapter of the Web Standards Group, where Richard Ishida, the Internationalization Activity Lead of the W3C gave a remarkably clear presentation of one of the most ignored issues in web development: character encodings.

Have you ever noticed certain characters on your site not displaying the way they should? Perhaps the curly quotation marks look like little boxes, or the long dashes have been replaced with question marks. Problems like these usually arise from an incomplete understanding of character encodings on the part of the developer responsible for the site.

I’d go so far as to guess that, in English speaking circles at least, most web developers that have never learned about character encodings, and just deal with the consequences when issues like the above crop up.

As a site grows to the point where it must address an international audience (or even just an audience that likes curly quotes), however, it’s more and more difficult to ignore these issues. Even worse, in these heady times of daily hack attempts, incorrect handling of character encodings can result in severe security vulnerabilities (as Google recently discovered).

So what is a character encoding, exactly? Well, let’s start with something it’s not: a character encoding is not a character set.

Character Sets

A character set, or more specifically, a coded character set is a set of character symbols, each of which has a unique numerical ID, which is called the character’s code point.

Some examples of character sets include the 128-character ASCII character set, which is mostly made up of the letters, numbers, and punctuation used in the English language, and the 256-character ISO-8859-1, or Latin 1 character set, which includes all the ASCII characters plus accented and other additional characters used in related languages like French. The most expansive character set in common use is the Universal Character Set (UCS), as defined in the Unicode standard, which contains over 1.1 million code points.

The first thing to understand is that every HTML document uses Unicode’s UCS, or more accurately the ISO 10646 character set, which is a less involved standard describing the same set of characters. Some older browsers, or less powerful devices, may not support (and thus will not display) the complete character set, but the fact remains that any HTML document may contain any character in the UCS.

What does vary from document to document is the character encoding, which defines how each of the characters in the UCS is to be represented as one or more bytes in the text data of the page.

This figure shows the ASCII, ISO-8859-1, and Unicode code points for three characters (the letter ‘A’, the acute-accented letter ‘e’, and the Hebrew letter ‘alef’), and how those characters map to a series of bytes in five common character encodings:

Some sample character encodings

Looking first at the character sets, note how the letter ‘A’ is available as a character in all three character sets, but the acute ‘e’ isn’t available in ASCII, and ‘alef’ is only available in Unicode. The fact that characters maintain the same code points across multiple character encodings is due to the fact that ISO-8859-1 was designed as an extension of ASCII, and Unicode in turn was designed as an extension to ISO-8859-1. There are certainly other character sets where the code points of these characters, where they exist, would differ.

As I mentioned above, however, web pages always use the Unicode character set, so these code points are the only ones that matter for the purposes of web development.

Character Encodings

Now take a look at the character encodings in the figure. The first, 7-bit ASCII, dates back long before the days of MS-DOS, and is commonly used today as a "lowest common denominator" in email systems. If an email message contains only characters from the ASCII character set, and those characters are encoded as per their ASCII code points (e.g. the letter A is code point 65, which in hexadecimal (base-16) is 41, so the byte value used to represent it should be 41), then it should be compatible with any Internet email system, no matter how obsolete. Because ASCII contains only 128 code points, only seven of the eight bits in a byte are needed to represent any ASCII character. The byte values in a 7-bit ASCII document will therefore never exceed 7F (that’s 127 in base-10).

ISO-8859-1 is the default encoding assumed by many browsers and related English-language software. It uses all eight bits of each byte to represent all 256 code points in the ISO-8859-1 character set. Though this provides the characters required for the vast majority of English language documents, as well as documents in many related languages like French, there are plenty of languages that are based on characters not included in this set. Even certain specialized characters in English documents, curly quotes and long dashes for instance, are not a part of ISO-8859-1. This explains why such characters are most often responsible for revealing a character encoding problem.

To serve the needs of other languages, there are an abundance of character encodings like ISO-8859-1 that make use of the possible byte values to represent a set of 256 characters. Additionally, there are a number of character encodings that use two bytes per character to allow for 65,536 different characters. Commonly used for Chinese and other languages requiring a large number of characters, these encodings are called double-byte character sets (DBCS), even though they are in fact encodings.

But for documents that may contain characters from any language, the best encodings are those that can address Unicode’s entire UCS. The simplest of these is UTF-32, which simply uses four bytes to represent each UCS character by its code point. ‘A’, which is code point 65 (41 hex) is represented by the four byte values 00 00 00 41, the acute ‘e’ (code point E9 hex) is 00 00 00 E9, and ‘alef’ (05D0 hex) is 00 00 05 D0.

The problem with UTF-32 is that, because the vast majority of characters in documents occur early in the UCS, almost every character in a given document will begin with two 00 bytes, which is quite a waste. Effectively, most UTF-32 documents will be four times the size of the same documented encoded in a single-byte encoding like ISO-8859-1.

The UTF-8 and UTF-16 encodings address this by using a variable number of bytes per character. In UTF-8, the most common characters use only a single byte, which is equal to that character’s UCS code point, while less common characters use two, even rarer characters use three, and only the very rarest of characters use four bytes. UTF-16 accomodates a larger set of "common" characters whose two-byte encodings match their UCS code points, reserving three- and four-byte encodings for rarer characters.

Looking at the figure, you can see that the ‘A’ character has encodings that match its UCS code point in both UTF-8 and UTF-16. The acute ‘e’ and ‘alef’, on the other hand, are less common characters that each have a special two-byte encoding in UTF-8 that differs from its UCS code point. In UTF-16, however, both acute ‘e’e and ‘alef’ are considered common enough to get an encoding that matches their two-byte code points (00 E9 and 05 D0, respectively).

Make sense? If you’ve followed this far, you’ve grasped all the concepts you need to work intelligently with character encodings. Keep reading to find out how all this affects your work as a web developer.

Character Encodings and the Web

Okay, so a character encoding specifies how a set of characters (like Unicode’s UCS, which is used on the web) can be written as bytes in a stored document. So what does this mean to web developers?

As a web developer, there are two types of text data that you need to deal with: the text that makes up the pages of your site, and the text that is sent by your users’ browsers (usually as a form submission). In each case, you should be aware of the character encoding that is in use, and treat that data accordingly.

It turns out that the encodings of these two bodies of text data are linked: the default encoding that a browser will use when submitting a form is governed by the encoding of the document that contained the form. A page encoded in ISO-8859-1 will submit its form data in ISO-8859-1, while a page encoded in UTF-8 will submit in UTF-8.

So the first thing you need to do is pick an appropriate encoding in whichever editor you use to create your web documents. Depending on your editor, this will involve setting a configuration option (e.g. in Dreamweaver), or simply choosing the right encoding when you first save the file (e.g. in Notepad).

You also need to tell browsers which encoding your documents are using. Browsers cannot guess the character encoding–every document just looks like a series of byte values until an encoding is provided to interpret them. So next you must declare the character encoding of each of your documents. To indicate the encoding of an HTML document, include an appropriate <meta> tag. For ISO-8859-1:

<meta http-equiv="Content-Type"
    content="text/html; charset=ISO-8859-1" />

For UTF-8:

<meta http-equiv="Content-Type"
    content="text/html; charset=UTF-8" />

Yes, that’s right: you specify the character encoding with an attribute called charset. No wonder people find this stuff confusing!

You might wonder how a browser can even read this tag if it doesn’t yet know the character encoding, but it turns out that most encodings in popular use have enough characters in common that the simple HTML code leading up to this tag can usually be interpreted by guessing at a simple encoding (say ISO-8859-1), and then starting over if the tag indicates the browser has guessed wrong.

For CSS and JavaScript files, things are trickier. While the standards offer ways to indicate the encodings of these files, support for these is spotty. If you need to use characters outside the relatively safe ASCII character set in these files, you’ll need to configure your web server to identify the character encoding in HTTP headers that are sent with these files. For example:

Content-Type: text/css; charset=UTF-8

You can use the HTTP header approach for HTML documents as well, but you should still include the <meta> tag as backup in case the document is loaded without HTTP headers (e.g. it is loaded directly from the file system with a file:// URL).

Once you’ve specified an encoding, you can verify that browsers are picking up on it. Open the page in Firefox, right-click the background and choose Page Info. The window that appears will show the character encoding that was used to interpret the document.

Page Info window

So all this begs the question, which character set should you be using? Well, in most cases, the answer is UTF-8. It gives you access to a multitude of characters in your documents without significantly increasing the file size, and it’s reasonably backwards-compatible with older browsers and simple devices that do not support Unicode. If, however, you need to use significant quantities of CJK (Chinese, Japanese or Korean) text, which will necessitate a larger character set, then you might find UTF-16 is a more efficient choice.

That is, unless you’re using PHP. One of the biggest weaknesses of PHP (up to and including PHP 5.1) is that its built-in string functions handle multi-byte character encodings like UTF-8 and UTF-16 incorrectly. PHP was written with the assumption that one byte equals one character, which simply isn’t the case in such encodings. An optional module or library can be used to provide alternative string functions that do support multi-byte characters, but many of the PHP scripts in circulation use the built-in functions, and simply can’t handle Unicode characters as a result.

This problem will be addressed in PHP 6, where Unicode support will be an integral part of the language, but in the meantime getting PHP to treat Unicode correctly is something of a black art. It’s certainly possible to do–high quality PHP scripts like WordPress and phpBB handle Unicode quite well–but you really need to know your PHP to do it.

For this reason, PHP-based web sites are commonly written using the ISO-8859-1 encoding. SitePoint’s article and forum pages, for example, are all written using ISO-8859-1.

As you can probably gather, using ISO-8859-1 has a few disadvantages. For one thing, you’re limited to using that relatively small character set to write your documents. What happens when you need a curly quote, or some other character not found in the ISO-8859-1 set?

HTML’s answer to this problem is the character entity. I’m sure you’re familiar with these: codes like &rdquo; (right-hand double quotes) and &mdash; (em dash) let you include characters not available in your chosen encoding in your document’s text. For more exotic characters that do not have an easy-to-remember code in HTML, you can use numeric character entitiesreferences instead. To include the character ‘alef’ in an ISO-8859-1 document, for example, you would use either &#1488; or &#x05d0;, the decimal and hexadecimal versions of the character’s UCS code point, respectively.

Take a moment to absorb the fact that numeric character entities refer to UCS code points for characters, not the byte values for characters in any particular encoding. The numeric character entity for ‘alef’ is the same no matter what encoding you are using in your document.

So character entities let you deal with characters outside your selected encoding when writing documents, but what about the other side of the coin? How do you deal with characters outside a limited encoding like ISO-8859-1 when it comes to form submissions?

Sadly, this is one place where browsers have disagreed for a long time, and even today, after much pulling of hair and gnashing of teeth, the solutions that most browsers now support are less than ideal.

One of the biggest problems is Windows, which on English language systems makes use of a slightly modified version of ISO-8859-1 called Windows-1252. Sam Ruby has documented the differences in his survival guide. Windows-1252 represents certain useful characters like curly quotes as single bytes, taking the places of less commonly used ISO-8859-1 characters. As a result, Internet Explorer browsers will often consider such characters as being within the document encoding, and will submit them as such. On the server, these single-byte encodings get interpreted as their ISO-8859-1 equivalents, which is what often leads to ugly boxes and other nonsense characters showing up on web pages in the place of curly quotes and the like, particularly when text entered on a Windows system is displayed on a non-Windows browser like Safari.

That exception aside, most current browsers, when faced with a character that is not in the encoding in which a form is to be submitted, will convert that character to a numeric character entity and submit that instead. This may sound sensible at first, but consider that HTML forms are supposed to submit plain text, not HTML code. Special characters like < and > are not automatically encoded as &lt; and &gt; for submission by forms, nor should they be. This auto-conversion of out-of-encoding characters means that, in an ISO-8859-1 document, you can’t tell from the submitted form data whether the user actually typed the character ‘alef’, or the series of characters &#1488;.

Some browsers have approached this problem differently, replacing certain out-of-encoding characters with in-encoding equivalents (e.g. curly quotes with straight quotes), and replacing other problem characters with a generic substitute (e.g. ‘?’). While this solution is technically superior, you do miss out on the few cases where the more common approach described above manages to preserve the desired characters without any side-effects.

A full discussion of how different browsers tackle the problem of character encoding in form submissions would take too long to go into here, but there are good writeups available for those who go looking. In short, however, your best bet for conquering these problems is to move your site to UTF-8 (or UTF-16 if appropriate) as soon as you can.

Further Reading

Much of the information above in this issue is distilled from the second hour of a talk that Richard Ishida gave to the Melbourne Web Standards Group not long ago. If I’ve piqued your interest but you’re still a bit foggy on the details, you can listen to the complete audio of that presentation, and read through his slides, enhanced with complete tutorial notes.

Once you start working with Unicode, you’ll find a number of utilities on Ishida’s
site will come in very handy. There’s a tool for browsing the complete UCS, and another for converting between Unicode characters, code points, encodings, and numeric character entities, both of which are definitely worth bookmarking.

Updated: The code point for the acute ‘e’ was wrong in the original version of this article.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • monty

    Thanks for this informative article, Kevin, character-encoding can be pretty confusing for most people. I’d love to see an article on how character encoding and PHP/MySQL work together, and how moving from ISO to Unicode in MySQL could impact your programming and existing apps built with PHP. Something I’m going to have to do soon with an older app running a version of MySQL that didn’t support Unicode at the time.

  • fenryr

    Great information!
    Pretty confusing but you explained it nicely!

  • andre

    hmm…the ascii code of the capital letter A is 65 decimal or 41 hex, not 41 decimal (29 hex). anyway, we get the point :)

    Kevin: Thanks, Andre! Fixed.

  • psalzmann

    Yes, I agree. Encodings can be very complicated. To combat the mysql / php / encodings issue, we simply store all special characters within MySQL as Numeric Entities. :) &#xxx;

  • http://www.xhtmlcoder.com/ xhtmlcoder

    Regarding: “…Chines, Japanese or Korean…” I assume you meant Chinese.

    Kevin: Thanks! Fixed.

  • Richard Quadling

    As someone who has only ever written in English, spoken in English, read in English, the whole character encoding/locales/character sets/etc is pretty mystifiying.

    The article helped me understand a LOT of the issues.

    I’m also aware that there are XSS attacks that are based on fooling character encoding software.

    I think it would be extremely useful to have some articles describing the best practise for developers on how to deal with this.

    For example, we are often told to ALWAYS treat all external data as junk and to ALWAYS validate it. The article in php|Architect Vol 5 Issue 2, “Doing it Japanese Style” is a very good way of forcing developers down a route of validating the data.

    Validating unicode data and protecting against XSS should go hand in hand.

    But for me, who has only had to deal with english, the most complicated character I’ve had to deal with is £ (the UK currency symbol).

    So and an article would be nice.

  • http://www.phppatterns.com HarryF

    Great job Kevin!

    As someone who has only ever written in English, spoken in English, read in English, the whole character encoding/locales/character sets/etc is pretty mystifiying.

    Its also really hard to do a good job even explaining the problem, let alone solving all the issues. Part of the problem for beginners is you can’t “see” character encoding with normal editors / tools. It also gets into “bits and bytes” which you’re not normally confronted with in web app development – you really need to understand UTF-8 in terms of the 0’s and 1’s it’s made of.

    Specific to PHP, there’s some further information in wiki form available at http://www.phpwact.org/php/i18n/charsets and http://www.phpwact.org/php/i18n/utf-8 – I’ve wrote most of those but not yet sure whether they’re really accesible to beginners – what Kevin has done here probably makes a much better starting point.

    One other excellent but long read: http://www.cs.tut.fi/~jkorpela/chars.html

  • alles_klar

    Great article Kevin.

    This is this type of articles that draws a beginner, self-learning, PHP programmer such as myself to your blogs, forums and books. I bought a few of the latter.

    Keep up the good work.

    Harry,

    I learnt a ton from your OOP PHP books. You need to correct the following link ‘http://www.phpwact.org/php/i18n/utf-8—I’ve’

    Remove the ending part that’s part of the next sentence.

  • Anonymous

    Please note that character entities (stuff like &raquo; or &mdash;) do not work when a document is served as application/xml+xhtml (unless you define them). You can still get away with the numeric form though.

  • http://www.saumendra.com saumendra

    This is a very usefull article for all, who are into web development. The Future senario might be like the matrix of language support in all the websites to come: User and regional Customisation..

    Thanks Kevin.

  • fibo

    I have done a site with multicharacters pages, mixing English, French, Greek, Russian. The texts come from a MySQL database.
    My own findings (lots of grey hair):
    – It is safer to have all files in UTF-8, whether PHP files, HTML code, or PHP included files.
    – Although notepad is fine to edit existing UTF-8 files, there are situations where creating UTF8 files seems not to work.
    – I once tried to develop with Nusphere’s IDE. Don’t knoiw how the current version does, but wjen I checked around 1 year ago… I found that they were supporting Unicode but not UTF8, eg, it was not possible to have in the source code at the same time native French accented characters and Greek and Russian characters. OTH, Zend’s studio does work with UTF-8 files (but frequently you have to convert the file to UTF-8 outside of Zend).
    – To securely and safely manage these character code conversions/ check, I extensively use Ultra-Edit and I could NOT have done the site without it. So, if you neeed to work with UTF-8 files, invest in Ultra-Edit or a similar editor for your toolbox.
    – For the site I have been doing http://www.mae.u-paris10.fr/limc-france/ I was using a pre 4.1 version of Mysql, where there is no special management for utf-8 chars. It is perfectly fine however, even though some care is needed for:
    — uploading CSV files: you need to put some “real UTF-8″ chars in at least one of the records for each column you need to be UTF-8 [and it is simpler if you decide that almost ALL your columns/ fields will be UTF8]. I had problems with columns where I had just plain ASCII and/or French accented chars: they were interpreted as single byte ISO-9951 (or Windows equivalent) and NOT as two- or multiple- byte chars
    — sorting and subsetting: I had no acces to the multibyte php functions, and used extensively utf8 encode and decode.

    A FINAL NOTE:
    AT some stage when you get funny displays it becomes very difficult to understand what is happening exactly. In such occasions, I have been systematically using debugging information that displayed not only the “normal data” (assumed to be utf-8 displayed in an UTF-8 page), but also both the utf8-encode AND the utf8-decode of the same data. Usually one of the 3 was right and it was then easier to find what was happening and to track down what had happenned at data-storing time.

  • http://www.dotcomwebdev.com chris ward

    This needs to be an article post on the site, I spent a whole day on the phone to a big company in the states sorting out a problem with character encoding in URLs… nightmare!

    especially when the world’s favourite browser goes the non-standards route… im not even going to waste my time getting annoyed by this in a post :) </rant>

  • Pingback: EsLoMas.com »

  • http://autisticcuckoo.net/ AutisticCuckoo

    This is a very good writeup, which every web designer/developer should read.

    Sorry for nitpicking, but I spotted a couple of minor things:

    The first thing to understand is that every HTML document uses Unicode’s UCS.

    Actually, the character repertoire of HTML is ISO 10646, which is not exactly the same as Unicode.

    For more exotic characters that do not have an easy-to-remember code in HTML, you can use numeric character entities instead.

    That should be ‘numeric character references‘ (they are not entities).

  • Shauna Fjeld

    The clearest explanation of character encoding I have ever seen!
    Thanks!

  • Pingback: SitePoint Blogs » AJAX Gotchas

  • Pingback: otro blog m

  • Pingback: Interesting article on character encodings - Wireless Forums and Wifi Forum

  • sd1978

    Hi I am Sourabh,
    consider a variable A = abcd
    I want a code which should convert the “EACH” characters in that variable first to ASCII code.
    Then add 2 to it and then again convert it to a Character and store it in database.

    Like first it will convert ‘a’ to its respective ASCII value, then it will add 2 in it and then again it will convert that value in Character and save it.

    This process is generally called as Encryption. Kindly send me a code on it. The language is VB.NET and the database is SQL SERVER 2000.

  • DMP

    A very good article!
    I now use UTF-8 on my site and have set the charset of my ajax script to iso so the alerts windows in js works fine.

  • Anonymous

    Really nice article. Easy to read, easy to follow… I am waiting for the next one.
    ‘Rol’

  • Shripad

    This article really helped a lot. Though I have one question. I understand that the Charset Encoding for HTML page (display) and Form inside a page is controlled as one unit. What should I do if I have to submit a form using another charset ( Say ISO-8859-8). The form needs to display in WINDOWS-1252 or WINDOWS-1252.

  • PHPGuyZim|babwe

    Been struggling with encoding and ‘funny characters’ in my text for weeks now. Good read. I am working on a web crawler, php/mysql/linux. EVerything now works fine, but no matter what i try, i het the wierd ‘a and two diamonds’ in place of single quotes. Any ideas? My output header is utf8, the meta tag is set to charset utf8, database columns are utf8_general_ci all php script files are saved in utf8 format. :( I’m stumped

  • Prabhat

    Good article to come out from the enigmatic world of Character Encoding.Easy to learn and grasp .The contents are presented in very lucid way even if a novice technical personal can understand about different character encoding system.