PHP UTF-8 0.1 — SitePoint

Been messing around with bits of this code for a long time, in fact since first really getting to grips with Dokuwiki, but finally got the first release out.

PHP UTF-8 is intended to make it possible to handle UTF-8 encoded strings in PHP, without requiring the mbstring extension (although it uses mbstring if it’s available). In short, it provides versions of PHP’s string functions (pretty much everything you’ll find on this list), prefixed with utf_ and aware of UTF-8 encoding (that 1character >= 1 byte). It also gives you some tools to help check UTF-8 strings for “well formedness”, strip bad sequences and some “ASCII helpers”.

Parts of the code are cannibalised from elsewhere – thanks to Andi Gohr (Dokuwiki UTF-8) and Henri Sivonen for his UTF-8 to Code Point Array Converter (which was ported to PHP from the Mozilla codebase).

You’ll have to forgive a little pride but, from initial benchmarks on the most critical native (non-mbstring) functions, performance is almost as good as the mbstring functions – within the acceptable range. The key was this inspired tip. Otherwise it’s bending that /u PCRE_UTF8 pattern modifier to good use.

Anyway – documentation is thin on the ground, apart from inline in the source code – need to do a tutorial eventually. For a “from scratch” lead in, best places to go are the charsets and PHP utf-8 pages of the WACT wiki.

A big warning if you plan to use this – do not “blindly” replace all use of PHP’s string functions with functions from this library – most of the functions is provides you will only ever need occasionally, if at all and although performance is “acceptable”, it’s not as fast as the real str* thing. The two key things you must have clear in your mind is that ASCII-7 (US-ASCII) is a subset of UTF-8 and that each valid UTF-8 sequence is unique (cannot be mistaken as a subsequence of another, longer sequence). It’s worth spending a long time looking at the table on this page.

Anyway – hopefully it helps make mbstring independent PHP applications darn near possible. Failing that, PHP6 may be around by the end of the year.