Character Encoding: Issues with Cultural Integration

Tweet

I’ve run into a classic problem with charsets, in an application I’m currently working on. As is the standard for PHP, all strings are treated as latin1, but we now need to allow a wider range of charsets in a few places.

The gold standard solution is to convert everything to utf-8. Since utf-8 covers the entire unicode range, it is capable of representing any character that latin1 can. Unfortunately, that’s a lot easier to do from the outset, than with a big, running application. And even then, there may be third party code and extensions, which assume latin1. I’d much rather continue with latin1 being the default, and only jump through hoops at the few places where I actually need full utf-8 capacity.

So after some thinking, another solution dawned on me. To be fair, hack is probably more descriptive than solution, but nonetheless. The idea goes as follows:

  • Use latin1, but serve pages in utf-8, encoding it at output.
  • Embed utf-8 strings within latin1, and somehow don’t encode it (But still encode everything else).

Simple, eh?

Latin1 on the inside, utf-8 on the outside.

When rendering HTML pages, it is trivial to capture the output with an output buffer and pipe it through utf8_encode. The page is thus served in utf-8, even though everything internally is latin1. Not much gain in that, since it still restricts us to use the range of characters covered by latin1.
We are actually already doing this, simply to reduce the number of problems for external services communicating with our system. In particular, XmlHttpRequest defaults to utf-8, regardless of the page’s encoding.

In essence, the following snippet exemplifies:


// declare that the output will be in utf-8
header("Content-Type: text/html; charset=utf-8");
// open an output buffer, capturing all output
ob_start('output_handler');
// when the script ends, the buffer is piped through this functions, encoding it from latin1 to utf-8
function output_handler($buffer) {
  return utf8_encode($buffer);
}

Embed utf-8 within latin1.

This is the tricky part. Instead of simply piping the entire buffer through utf8_encode, the string can be parsed so anything between a set of special tags (Eg. [[charset:utf8]] ... [[/charset:utf8]]) is left as-is, while the rest is assumed to be latin1 and encoded with utf8_encode as before. This ensures full backwards compatibility, while allowing real utf-8.

Let’s modify our output-handler from before:


header("Content-Type: text/html; charset=utf-8");
ob_start('output_handler');
function output_handler($buffer) {
  return preg_replace_callback(
    '~[[charset:utf8]](.*?)[[/charset:utf8]]~',
    'utf8_decode_first',
    utf8_encode($buffer));
}
function utf8_decode_first($match) {
  return utf8_decode($match[1]);
}

And that’s it. We can now embed full utf-8 strings within our otherwise latin1-encoded application, by wrapping it with [[chaset:utf8]]. To make things a bit more readable, I added a helper function:


function utf8($utf8_encoded_byte_stream) {
  return '[[charset:utf8]]' . $utf8_encoded_byte_stream . '[[/charset:utf8]]';
}

And we can now construct a string as simple as:


echo utf8("blÃ¥bær") . "grød";

To produce the output: blåbærgrød

note: As pointed out by Kore, it would be a problem if the delimiter itself (Eg. [[charset:utf8]]) is part of the data. To remedy this, it would be safer to use a more unique delimiter. You could simply replace charset:utf8 with something that is unlikely to ever happen. It’s still not completely bulletproof, but it’s good enough for most practical uses.

Handling input.

You may or may not know this, but when submitting a form, browsers send back data in the same encoding as the page was served. Since our application is predominantly latin1, we need user-input to be latin1, to keep BC. So all input must be decoded from utf-8 to latin1. This is simple enough; We just have to pipe all user-input ($_GET, $_POST etc.) through utf8_decode. Since we already run with the latin1-on-the-inside-utf-8-on-the-outside scheme, this was already in place in our case.

This does however give a problem when the user needs to submit utf-8, as our users would need when replying to mails. So in these places, we would have to explicitly access the “raw” string, through an alternate mechanism. In our case, we needed to modify our http-request wrapper, but since this is extending the API, there is no BC problems.

With the advent of PHP6, perhaps such hacks won’t be necessary in the future, but for now this gives a working, unobtrusive solution.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://autisticcuckoo.net/ AutisticCuckoo

    You may or may not know this, but when submitting a form, browsers send back data in the same encoding as the page was served.

    This is the default behaviour if the accept-charset attribute is omitted from the <form> tag. By using accept-charset="utf-8" you can instruct browsers to send the data encoded with UTF-8. As far as I know, browser support for this is quite decent.

  • kore

    Hi,

    a) Strings in PHP do not have any charset or encoding information associated. They are just binary, like described here.

    b) Why do you want to convert to Latin1 anyways? It might only be relevant, if you need to process strings character-wise, what should not be necessary in “normal” applications. If you need to do so, take a look here.

    c) For charset and encoding conversions you might want to use the iconv() functions. It does not only handle more encodings, but can also handle character set incompatibilities between encondings (transliteration, ignore).

    d) The charsets/encodings browsers send can be influenced either by the encoding information given in the Content-Type headers (HTTP, HTML-meta-tags) or the form attribute already mentioned in another comment. Not mentioning, that all clients of course may send garbage which needs to be sanitized.

    e) You should not try to parse such potentially recursive markup with regular expressions – and even using it in examples encourages people to follow this example. This will never work.

    You just should ensure you use the same encoding throughout your application (mind the backend connections) and there won’t be any real problems.

  • http://autisticcuckoo.net/ AutisticCuckoo

    Strings in PHP do not have any charset or encoding information associated. They are just binary, like described here.

    LOL. The binary representation of characters is exactly what an encoding is. The problem with PHP is that the string functions in the standard library assumes one byte per character. There are multi-byte string functions available, but then you have to choose which encoding to use.

    Java and JavaScript, on the other hand, internally use 16 bits to represent each character. That means they can at least handle the BMP (basic multilingual plane) in Unicode.

  • kore

    AutisticCuckoo:

    An encoding maps the characters of a specific character set to a sequence of bytes. Of course strings in PHP (<6) may contain strings encoded with each encoding and each character set. But there is no character set associated, and all string functions in PHP (<6) just operate on bytes – which actually do not much differ from characters in single-byte encodings. But there is *no* information if it is Latin1 or ISO-8859-*, or similar.

  • Tom

    Converting UTF-8 to Latin-1 means that you loose all chars not in Latin-1. That way you’re not able to handle cyrillic, greek, chinese or … chars.

    I suggest using UTF-8 internally. If you’re fetching data from old parts of the application that still use Latin-1 you can convert them to UTF-8 without loosing informations.

    On input you can check for the used charset and convert this to UTF-8, too.

  • Troels Knak-Nielsen

    By using accept-charset=”utf-8″ you can instruct browsers to send the data encoded with UTF-8. As far as I know, browser support for this is quite decent.

    Yes. In this case, it’s redundant though, since browsers are also pretty consistent in sending back in the same encoding as they receive. It wouldn’t hurt though, and it’s better to be safe than sorry.

    You just should ensure you use the same encoding throughout your application (mind the backend connections) and there won’t be any real problems.

    You’re absolutely right. The premise however, is that I have a legacy application in latin1, and now need to use utf-8. Porting the entire application to utf-8 is a major undertaking, so it’s out of the question. I suspect that a lot of people are in a similar situation. What I described, is a technique for coping with an imperfect world.

    You should not try to parse such potentially recursive markup with regular expressions – and even using it in examples encourages people to follow this example.

    Good point. If someone actually uses the delimiter as part of data, then the parser would choke on it. This risk can be reduced by chosing a more unique delimiter, but it can never be solved completely. With a sufficiently unique identifier, the risk is very low, so I think I’ll be bold and brush this off as an academic issue; Thanks for pointing it out though. If the problem does arise, there is a single place in the application, where the delimiter can be changed to something better than charset:utf8, which – arguably – isn’t very unique.

  • Troels Knak-Nielsen

    Converting UTF-8 to Latin-1 means that you loose all chars not in Latin-1. That way you’re not able to handle cyrillic, greek, chinese or … chars.

    I’m not sure if this comment was meant for me, but in that case I think you missed the whole point. The idea of embedding utf-8 encoded strings within a sea of latin1, is exactly to preserve the full range of unicode, that these strings have. There is no conversion to latin1 in this recipe.

  • Tom

    “So all input must be decoded from utf-8 to latin1.” – Means that you are not able to get input in other charsets.

  • Troels Knak-Nielsen

    And that’s why, in the following paragraph, I say: “So in these places, we would have to explicitly access the ‘raw’ string, through an alternate mechanism.”.

    Perhaps I wasn’t being clear about what this meant, so let me to illustrate with a concrete example. Assume that your legacy, latin1-only application uses the superglobals (Eg. $_GET, $_POST etc.) directly. To assure BC, you would have to stick something like this to the top of your script:

    $_POST = array_map(‘utf8_decode’, $_POST);

    Obviously the above implementation is naïve, but I hope it conveys my point.

    Now, since we’re adding a piece of UTF-8 aware code into the application, we need this to be able to retrieve the raw, un-decoded input in those places. So let’s add this:

    $GLOBALS['POST_UTF8'] = $_POST;
    $_POST = array_map(‘utf8_decode’, $_POST);

    Now, the UTF-8 aware code can use $GLOBALS['POST_UTF8'], while still keeping full BC with the legacy code (Since $_POST will only contain latin1).