Programming - - By Harry Fuecks

Hot PHP UTF-8 tips

As a result of all the noise about UTF-8, got an email from Marek Gayer with some very smart tips on handling UTF-8. What follows is a discussion illustrating what happens when you get obsessed with performance and optimizations (be warned – may be boring, depending on your perspective).

Outrunning mbstring case functions with native PHP implementations

The native PHP strtolower / strtoupper functions don’t understand UTF-8 – they can only handle characters in the ASCII range plus (may) examine your servers locale setting for further character information. The latter behaviour actually makes them “dangerous” to use on a UTF-8 string, because there’s a chance that strtolower could mistake bytes in a UTF-8 multi-byte sequences as being something it should convert to lowercase, breaking the encoding. That shouldn’t be a problem if you’re writing code for a server you control but it is if you’re writing software for other people to use.

Restricting locale behaviour

Turns out you can disable this locale behaviour by restricting your locale to the POSIX locale, which means only characters in the ASCII range will be considered (overriding whatever your server’s locale settings are), by executing the following;


<?php
setlocale(LC_CTYPE, 'C');

That should work on any platform (certainly *Nix-based and Windows) and effects more than just strtolower() / strtoupper() – other PHP functionality picks up information from the locale, such as the PCRE /w meta character, strcasecmp() and ucfirst(), all of which might result in adverse effects on UTF-8.

The only issue, as I see it, is if you’re writing distributable software; should be messing with setlocale in the first place? See the warning in the documentation here – can be a problem for Windows where you have only a single server process – you may be effecting other apps running on the server.

Fast Case Conversion

To make it possible to do case conversion (e.g. strtolower/upper) without depending on mbstring (because who knows if shared hosts have installed it?), applications like Mediawiki (as in Wikipedia) and Dokuwiki solve this by implementing pure-PHP versions of these functions and using arrays like this or this ($UTF8_LOWER_TO_UPPER variable towards end of the script), which works because only a limited selection of alphabets have the notion of case in the first place – the array is big but not sooo big that it’s a terrible performance overhead. What’s interesting to note about both those lookup arrays is they contain characters in the ASCII range. They’re also support many alphabets.

Mediawiki then (essentially) does a str_to_upper like this (at least in the 1.7.1 release – see languages/LanguageUtf8.php – this seems to have changed since under SVN);



        // ... bunch of stuff removed
        return preg_replace( "/$x([a-z]|[\xc0-\xff][\x80-\xbf]*)/e",
              "strtr( "$1" , $wikiUpperChars )",
              $str
        );


…it’s locating each valid UTF-8 character sequence and executing PHP’s strtr() function with the lookup array, via callback – the /e pattern modifier (time to phone a friend?) to convert the case. That keeps memory use minimal, traded against performance (probably – not benchmarked) – many callbacks / evals.

Dokuwiki (and phputf8) uses a similar approach but first splits the input string into an array or UTF-8 sequences and sees if they match in the lookup array. This is PHP UTF-8’s implementation, which is almost the same (utf8_to_unicode() converts a UTF-8 string to an array of sequences, representing characters, and utf8_from_unicode() does the reverse) ;


function utf8_strtolower($string){
    global $UTF8_UPPER_TO_LOWER;
    
    $uni = utf8_to_unicode($string);
    
    if ( !$uni ) {
        return FALSE;
    }
    
    $cnt = count($uni);
    for ($i=0; $i < $cnt; $i++){
        if ( isset($UTF8_UPPER_TO_LOWER[$uni[$i]]) ) {
            $uni[$i] = $UTF8_UPPER_TO_LOWER[$uni[$i]];
        }
    }
    
    return utf8_from_unicode($uni);
}

That’s going to use more memory for a short period, given that it copies the input string as an array (actually that needs fixing!) plus an array would need more space to store the equivalent information to a string but (should) be faster.

Anyway – enter Marek’s approach which can be summarized as;


function StrToLower ($s)  {
     global $TabToLower;
     return strtr (strtolower ($s), $TabToLower);
}

… where $TabToLower is the lookup table (now minus the ASCII character lookups, handled by strtolower). Note the code Marek showed me uses classes – this is just a simplification. It relies on the POSIX locale being set (otherwise the UTF-8 encoding might get broken) and exploit a facets UTF-8’s design, namely any complete sequence in a valid UTF-8 string is unique (can’t be mistaken for part of a longer sequence). You also need to read the strtr() documentation very carefully…

strtr() may be called with only two arguments. If called with two arguments it behaves in a new way: from then has to be an array that contains string -> string pairs that will be replaced in the source string. strtr() will always look for the longest possible match first and will *NOT* try to replace stuff that it has already worked on.

I’ve yet to benchmark this but Marek tells me he’s found it to be roughly x3 faster than the equivalent mbstring functions, which I can believe.

Marek also employs some smart tricks for handling the lookup arrays. Both the dokuwiki and mediawiki approaches have all possible case conversions defined – i.e. they apply to multiple human languages. While this may be appropriate for user submitted content, when you’re doing stuff like localizations of you’re UI, chances are you’ll only be using a single language – you don’t need the full lookup table, just those applicable to the language involved, assuming you know what those are. Also you might think about looking at the incoming $_SERVER['HTTP_ACCEPT_LANGUAGE'] from the browser.

Anyway – when I get some time, will figure out how to use Marek’s ideas in PHP UTF-8.

Output Conversion

Another smart tip from Marek, which I haven’t seen discussed before, is how to deliver content to clients that can’t deal with UTF-8 e.g. old browsers, phones(?). His approach is simple and effective – once you’ve finished building the output page, capture it in an output buffer, check what the client sent as acceptable character sets ($_SERVER['HTTP_ACCEPT_CHARSET']) and convert (downgrade) the output with iconv if necessary.

You need to be careful examining the content of that header and processing it correctly. You also need to make sure you’ve redeclared the Content-Type charset plus any HTML meta characters or the encoding in an XML processing instruction. But this is certainly the serious / accessible way to solve the problem in PHP.

Moral of the story…

…is it’s worth talking to people who actually need UTF-8, vs. those in countries complacently using ISO-8859-1 (which doesn’t natively support the Euro symbol BTW!).

Given that Mediawiki has “done” Unicode Normalization in PHP (here), the only remaining piece of the puzzle is Unicode Collation (e.g. for sorting) – here’s a nice place for inspiration. After that – who needs PHP 6 ;)

Sponsors