Character Encodings and Input

Ever run your database-driven PHP site through an HTML validator and encountered an error message such as this?

Line 9, column 3: non SGML character number 145

Even worse, have you ever run your XHTML site through an XHTML validator as XHTML and encountered an error message such as this?

Sorry, I am unable to validate this document because on line 9 it contained one or more bytes that I cannot interpret as utf-8

If so, then you have the character encoding blues.

Text formats use character encodings to map characters to their binary representation. When using only characters in the ASCII range (US English), character encodings seem to ‘just work’. You may never even be aware of the character encoding you are using. This is because the ASCII characters are represented the same way in all of the popular character encodings used on the Web, so if you never need a foreign character you won’t ever encounter a problem. However, once you deviate from this common denominator of ASCII characters and start using characters from foreign languages, their representation in binary form may depend on the character encoding used, and if you get the encoding confused you can end up with invalid characters.

The problem is, if you write your PHP application using, say, ISO-8859-1 character encoding, which is the most common with HTML, you can’t rely on all input to PHP being valid in that character encoding. Browsers routinely ignore or are unaware of the character encoding your application wants. ISO-8859-1 contains reserved values which should not be used, yet if you copy from a Word document into a Web form and submit it, the text you’ve copied may well contain Windows Code Page 1252 (Windows-1252) characters, which are invalid in ISO-8859-1.

If those characters are then shown in the CMS, the resulting pages will not validate. Or if you’re using XHTML served as XML, the page won’t display at all!

PHP, unfortunately, has no ability to convert between character encodings or to validate a string to make sure it is valid in a particular character encoding. That is, unless you enable the mbstring extension (disabled by default). The mbstring extension supports a huge number of character encodings, common and uncommon. It can convert a string from one character encoding to another, perform lots of functions on strings that would otherwise disrepect the character encoding (such as changing the case of letters) and it can even parse the input from forms for you.

If you can’t install the mbstring extension, you may need to resort to a quick fix. If you use the ISO-8859-1 encoding in your CMS, you may use the following regular expression to strip out any characters that are not valid in this encoding:

// strip out characters that aren't valid in ISO-8859-1 $string = preg_replace('/[^x09x0Ax0Dx20-x7FxC0-xFF]/', '', $string);

A better solution to this would be to use a character encoding internally that can represent any character in any other encoding. Unicode character encodings are capable of this, and UTF-8 – a Unicode character encoding that represents ASCII characters as single bytes but other characters as multiple bytes, is a good choice. Unfortunately, without mbstring or a third-party library, using UTF-8 internally is impractical. It is difficult to weed out characters that are invalid in UTF-8, or convert from other formats to UTF-8 (the utf8_encode function cannot detect or filter out invalid characters – it just assumes the input is valid ISO-8859-1).

The comments in the utf8_encode function demonstrate the problems people have with character encodings in their code.