Character Encodings and Input

Tweet

Ever run your database-driven PHP site through an HTML validator and encountered an error message such as this?

Line 9, column 3: non SGML character number 145

Even worse, have you ever run your XHTML site through an XHTML validator as XHTML and encountered an error message such as this?

Sorry, I am unable to validate this document because on line 9 it contained one or more bytes that I cannot interpret as utf-8

If so, then you have the character encoding blues.

Text formats use character encodings to map characters to their binary representation. When using only characters in the ASCII range (US English), character encodings seem to ‘just work’. You may never even be aware of the character encoding you are using. This is because the ASCII characters are represented the same way in all of the popular character encodings used on the Web, so if you never need a foreign character you won’t ever encounter a problem. However, once you deviate from this common denominator of ASCII characters and start using characters from foreign languages, their representation in binary form may depend on the character encoding used, and if you get the encoding confused you can end up with invalid characters.

The problem is, if you write your PHP application using, say, ISO-8859-1 character encoding, which is the most common with HTML, you can’t rely on all input to PHP being valid in that character encoding. Browsers routinely ignore or are unaware of the character encoding your application wants. ISO-8859-1 contains reserved values which should not be used, yet if you copy from a Word document into a Web form and submit it, the text you’ve copied may well contain Windows Code Page 1252 (Windows-1252) characters, which are invalid in ISO-8859-1.

If those characters are then shown in the CMS, the resulting pages will not validate. Or if you’re using XHTML served as XML, the page won’t display at all!

PHP, unfortunately, has no ability to convert between character encodings or to validate a string to make sure it is valid in a particular character encoding. That is, unless you enable the mbstring extension (disabled by default). The mbstring extension supports a huge number of character encodings, common and uncommon. It can convert a string from one character encoding to another, perform lots of functions on strings that would otherwise disrepect the character encoding (such as changing the case of letters) and it can even parse the input from forms for you.

If you can’t install the mbstring extension, you may need to resort to a quick fix. If you use the ISO-8859-1 encoding in your CMS, you may use the following regular expression to strip out any characters that are not valid in this encoding:


// strip out characters that aren't valid in ISO-8859-1
$string = preg_replace('/[^x09x0Ax0Dx20-x7FxC0-xFF]/', '', $string);

A better solution to this would be to use a character encoding internally that can represent any character in any other encoding. Unicode character encodings are capable of this, and UTF-8 – a Unicode character encoding that represents ASCII characters as single bytes but other characters as multiple bytes, is a good choice. Unfortunately, without mbstring or a third-party library, using UTF-8 internally is impractical. It is difficult to weed out characters that are invalid in UTF-8, or convert from other formats to UTF-8 (the utf8_encode function cannot detect or filter out invalid characters – it just assumes the input is valid ISO-8859-1).

The comments in the utf8_encode function demonstrate the problems people have with character encodings in their code.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Mike

    Huh? Never heard of iconv?

  • http://webtech.lv kaklz

    The answer is just plain simple – use UTF-8 and forget about the encodings.
    There are countries out there, where you have to use more than one character set, even three of them or more.
    One of them is my country, Latvia, where you have to work with latin, cyrillic and baltic character sets.
    As soon as I started working with UTF-8, I don’t have to care about the character sets anymore. So if you are using any other character set than latin, I would suggest you to move on to UTF-8.

  • http://www.lopsica.com BerislavLopac

    I agree with kaklz above. Browsers use encoding specified in header to properly display HTML output and convert form entries, and databases store whatever comes from PHP — there is no need to convert strings internally. Even if you need to replace multibyte strings with other values, sprintf works like charm.

  • http://www.phppatterns.com HarryF

    FYI, the WACT team is slowly putting together some resources on i18n;

    http://wact.sourceforge.net/docs/doku.php?id=php:i18n

    Also something on charsets, with the emphasis on UTF-8:
    http://wact.sourceforge.net/docs/doku.php?id=php:i18n:charsets

    Not complete yet and needs some heavy revision – very much in “note form” right now but getting there

  • Glen

    I have found that most encoding problems can easily be by-passed by using UTF-8 encoding for all script generated HTML. i.e. In your , include the following:

    That way, the client will render UTF-8 correctly, and as a bonus will send back UTF-8 data from forms.

  • http://www.sitepoint.com/ mmj

    That way, the client will render UTF-8 correctly, and as a bonus will send back UTF-8 data from forms.

    Unfortunately, for many reasons, ranging from browser bugs to browsers blatantly ignoring the spec to users changing the character encoding setting in their browser, you cannot rely on submitted data to be in any particular character encoding. If you are expecting UTF-8 encoding and you receive something else, that something else may just break your CMS, or at least ensure that it will not validate as HTML.

    Thus you will arrive at the very problem described in this blog post.

    This article contains more information about why you can’t rely on input to be in any character encoding.
    http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html

  • http://www.sitepoint.com/ mmj

    Huh? Never heard of iconv?

    Hi Mike,
    I lumped iconv in with other third party libraries when I said “another third-party library”. There is a PHP extension, however, which interfaces with it. The extension is an alternative to mbstring. Like mbstring, it is disabled by default in PHP and must be explictly enabled. Unfortunately, not all PHP users are able to take advantage of these extensions as they might not be able to compile or configure PHP themselves. However, if you can, then it would be a good solution.

  • http://www.sitepoint.com/ mmj

    [quote=HarryF]
    Also something on charsets, with the emphasis on UTF-8: http://wact.sourceforge.net/docs/doku.php?id=php:i18n:charsets
    [/quote]
    Harry,
    That article is an excellent summary of the problem. Thanks!

  • Derick

    Erm, the iconv extension IS enabled by default in PHP 5.0 and higher

  • http://blog.phpdoc.info/ scoates

    inconv has been built in to PHP since PHP 5.

  • http://www.splintered.co.uk redux

    shame that not everybody is on PHP 5 then…

  • MiiJaySung

    This is one of PHP’s biggest flaws. multi lingual character representation should not be something the programmer should need to concern themselves about here. I am so suprised that the people at Zend haven’t just said, lets make sure all request data doesn’t come in as UTF-16 (it makes more sense to process strings in this form for speed and ease of handling with string functions), and then make the output default to UTF8 (unless overridden by the programmer) as UTF-8 is more compact than UTF-16.

    This is much like the mistake of the Magic quotes scenario. As a result, serious PHP programmers have to do countless validations / operations on request data to make sure slashes are stripped, and that make sure the character set is handled correctly. This is before we start protecting ourselves from XSS issues, and other validation issues.

    Oh well maybe they will correct this in PHP 5.2. It seems like there might some work on it for the 5.2 branch.

  • PhoebeBright

    This is probably a stupid question but I am so confused that I am going to ask it anyway.

    I have written an application that allows people to either upload a word file (saved as html or text) or paste in their text into a textarea then this is uploaded to a database for later display. In my naivety I thought that was all I had to do, but funny characters started appearing. I have not written a massive program full of ifs and elses that try to work out which character was meant and replace it, not always successfully. I can’t belive this problem has not been solved, and maybe it is just a varient of above, but I have not been able to find a good solution anywhere. If anyone could offer any pointers I would be extremely grateful!

  • http://www.michaelkrenz.de mkrz

    @PhoebeBright: Have a look at the pages mentioned in the comments above, this might help you clarify the issue. MS Word input is usually especially difficult to handle.

  • http://www.gamersmark.com -Oz-

    I wrote a function ( http://www.sitepoint.com/forums/showthread.php?p=1864510&posted=1#post1864510 ) that I think converts most of the weird characters into &blah; which appears to be accepted by XML validation and things.

  • Anonymous

    A big thanks to the article’s author… that lil’ preg_replace save my life!

  • Dennis

    Well, as far as I remember my last reading of HTTP specs, the browsers will send the data back to the server in the exact encoding of the HTTP responce. Moreover, in HTML 4.01 the form tag has an attribute “accept-charset” to explicitly set the encoding that the server expects.

    About mbstring – it’s a wonderful tool that will save you trouble validating multilingual strings with mb_ereg* functions in any supported encoding, not mentioning the full string handling support. Just one feature missing – string comparison and locale-aware collation. However, this functionality has been added in the Freeform i18n package that supports a really platform-independent locale and (from the now-under-development version 1.2.0.Beta) correct timezone support (visit http://dev6.php5.nedlinux.com/)

  • Anonymous


  • Célio Santana

    I’ve a lot of problems with encodings i start to use utf-8 and the problems continues, there´s no effective way to get some data from my SQL server and
    show to my users. I’m Brazilian and we use a lot of á, é, ç so these characters
    aren´t correctly treated by PHP. I don´t know i used mbstrings and doesn´t work
    too.

    What should i do?

  • kanga

    One of the techniques that works for just about everything regardless of encoding is the following:
    [1] Copy the line with the error from the source file
    [2] paste into notepad
    [3] copy from notepad
    [4] paste into source file and re-upload

    notepad will strip the errors. I learned this on an emergency alert project I was working on: http://www.kangalert.com