Been messing around with bits of this code for a long time, in fact since first really getting to grips with Dokuwiki, but finally got the first release out.
PHP UTF-8 is intended to make it possible to handle UTF-8 encoded strings in PHP, without requiring the mbstring extension (although it uses mbstring if it’s available). In short, it provides versions of PHP’s string functions (pretty much everything you’ll find on this list), prefixed with utf_ and aware of UTF-8 encoding (that 1character >= 1 byte). It also gives you some tools to help check UTF-8 strings for “well formedness”, strip bad sequences and some “ASCII helpers”.
Parts of the code are cannibalised from elsewhere – thanks to Andi Gohr (Dokuwiki UTF-8) and Henri Sivonen for his UTF-8 to Code Point Array Converter (which was ported to PHP from the Mozilla codebase).
You’ll have to forgive a little pride but, from initial benchmarks on the most critical native (non-mbstring) functions, performance is almost as good as the mbstring functions – within the acceptable range. The key was this inspired tip. Otherwise it’s bending that /u PCRE_UTF8 pattern modifier to good use.
Anyway – documentation is thin on the ground, apart from inline in the source code – need to do a tutorial eventually. For a “from scratch” lead in, best places to go are the charsets and PHP utf-8 pages of the WACT wiki.
A big warning if you plan to use this – do not “blindly” replace all use of PHP’s string functions with functions from this library – most of the functions is provides you will only ever need occasionally, if at all and although performance is “acceptable”, it’s not as fast as the real str* thing. The two key things you must have clear in your mind is that ASCII-7 (US-ASCII) is a subset of UTF-8 and that each valid UTF-8 sequence is unique (cannot be mistaken as a subsequence of another, longer sequence). It’s worth spending a long time looking at the table on this page.
Anyway – hopefully it helps make mbstring independent PHP applications darn near possible. Failing that, PHP6 may be around by the end of the year.
Related posts:
- How to Use PHP Namespaces, Part 3: Keywords and Autoloading In the final part of his series explaining PHP namespaces,...
- Build a Buzzword Bingo Card in PHP Bored in meetings? Worry no longer. Raena demonstrates how to...
- Introducing php-tracer-weaver php-tracer-weaver is a tool for automatically generating docblock comments, with...
- How to Use PHP Namespaces, Part 1: The Basics In the first part of a series of articles, Craig...
- How to Use PHP Namespaces, Part 2: Importing, Aliases, and Name Resolution In the second part of Craig's PHP namespaces series, he...







Looks like I get to be the first to say: congratulations, Harry, outstanding work (as usual).
February 27th, 2006 at 10:43 am
Many thanks although let’s wait and see for the bug reports… I�t�rn�ti�n�liz�ti�n
February 27th, 2006 at 11:28 am
Nice! BTW. I recently added simple romanization support for a bunch of non-latin languages to the UTF-8 lib in DokuWiki.
February 28th, 2006 at 8:39 am
Will definately be swiping that (assuming it’s OK). Might also take a shot at porting Text::Unidecode from CPAN.
February 28th, 2006 at 9:25 am
It is an amazing library. Thanks for the wonderful job Harry :)
Joomla development team has selected this library to provide utf-8 capabilities to Joomla 1.1 CMS. The PHP-UTF library has been integrated into the core framework of Joomla. A wrapper class provides a standard Joomla API for both core and 3PD extensions.
The result is that any PHP developer extending the Joomla framework gets utf-8 included thanks to PHP UTF8.
Even with PHP 6 on the horizon it is great to be able to provide true utf-8 capabilities with backward compatibility to PHP 4.1 and without having to force loading of mbstring on all those shared hosts.
March 1st, 2006 at 7:08 pm
[...] Following on from this release, the this turned out to be easier than I thought—ported Text::Unidecode to PHP—code available here or track down the utf8_to_ascii package from the main page—released it seperately to keep with the original (Perl artistic) license while the rest of the stuff is under LGPL. [...]
March 3rd, 2006 at 9:04 pm
[...] That is, unless you’re using PHP. One of the biggest weaknesses of PHP (up to and including PHP 5.1) is that its built-in string functions handle multi-byte character encodings like UTF-8 and UTF-16 incorrectly. PHP was written with the assumption that one byte equals one character, which simply isn’t the case in such encodings. An optional module or library can be used to provide alternative string functions that do support multi-byte characters, but many of the PHP scripts in circulation use the built-in functions, and simply can’t handle Unicode characters as a result. [...]
March 15th, 2006 at 7:08 pm
[...] For PHP developers especially, where limited out-of-the-box support for UTF-8 keeps many sites on single-byte character encodings, this issue could cause nasty surprises indeed. For example, if you want to add the ability to submit a form via AJAX and keep the standard submission method as a fallback, you could potentially end up having to support two different encodings for that submitted data! Tags: JavaScript, AJAX, PHP [...]
May 10th, 2006 at 12:08 pm
Hi Harry,
I was using str_replace to replace some parts of a utf-8 string with an ascii string, and I was getting some ? (question mark characters) So i tried your utf8_str_replace, but I am still getting the same characters.
Do I have to use the UTF-8 to Code Point Array Converter?
The utf-8 string I am trying to replace is in a PHP variable.
I don’t have mbstring extension installed and I was hoping to be abel to replace strings with your utf8.php functions.
Thanks
October 3rd, 2006 at 12:25 pm