PHP UTF-8 0.1

Tweet

Been messing around with bits of this code for a long time, in fact since first really getting to grips with Dokuwiki, but finally got the first release out.

PHP UTF-8 is intended to make it possible to handle UTF-8 encoded strings in PHP, without requiring the mbstring extension (although it uses mbstring if it’s available). In short, it provides versions of PHP’s string functions (pretty much everything you’ll find on this list), prefixed with utf_ and aware of UTF-8 encoding (that 1character >= 1 byte). It also gives you some tools to help check UTF-8 strings for “well formedness”, strip bad sequences and some “ASCII helpers”.

Parts of the code are cannibalised from elsewhere – thanks to Andi Gohr (Dokuwiki UTF-8) and Henri Sivonen for his UTF-8 to Code Point Array Converter (which was ported to PHP from the Mozilla codebase).

You’ll have to forgive a little pride but, from initial benchmarks on the most critical native (non-mbstring) functions, performance is almost as good as the mbstring functions – within the acceptable range. The key was this inspired tip. Otherwise it’s bending that /u PCRE_UTF8 pattern modifier to good use.

Anyway – documentation is thin on the ground, apart from inline in the source code – need to do a tutorial eventually. For a “from scratch” lead in, best places to go are the charsets and PHP utf-8 pages of the WACT wiki.

A big warning if you plan to use this – do not “blindly” replace all use of PHP’s string functions with functions from this library – most of the functions is provides you will only ever need occasionally, if at all and although performance is “acceptable”, it’s not as fast as the real str* thing. The two key things you must have clear in your mind is that ASCII-7 (US-ASCII) is a subset of UTF-8 and that each valid UTF-8 sequence is unique (cannot be mistaken as a subsequence of another, longer sequence). It’s worth spending a long time looking at the table on this page.

Anyway – hopefully it helps make mbstring independent PHP applications darn near possible. Failing that, PHP6 may be around by the end of the year.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • ajking

    Looks like I get to be the first to say: congratulations, Harry, outstanding work (as usual).

  • http://www.phppatterns.com HarryF

    Many thanks although let’s wait and see for the bug reports… I�t�rn�ti�n�liz�ti�n

  • Andi

    Nice! BTW. I recently added simple romanization support for a bunch of non-latin languages to the UTF-8 lib in DokuWiki.

  • http://www.phppatterns.com HarryF

    I recently added simple romanization support for a bunch of non-latin languages to the UTF-8 lib in DokuWiki.

    Will definately be swiping that (assuming it’s OK). Might also take a shot at porting Text::Unidecode from CPAN.

  • davidgal

    It is an amazing library. Thanks for the wonderful job Harry :)

    Joomla development team has selected this library to provide utf-8 capabilities to Joomla 1.1 CMS. The PHP-UTF library has been integrated into the core framework of Joomla. A wrapper class provides a standard Joomla API for both core and 3PD extensions.

    The result is that any PHP developer extending the Joomla framework gets utf-8 included thanks to PHP UTF8.

    Even with PHP 6 on the horizon it is great to be able to provide true utf-8 capabilities with backward compatibility to PHP 4.1 and without having to force loading of mbstring on all those shared hosts.

  • Pingback: SitePoint Blogs » US-ASCII transliterations of Unicode text

  • Pingback: SitePoint Blogs » Do you know your character encodings?

  • Pingback: SitePoint Blogs » AJAX Gotchas

  • dannolinux

    Hi Harry,

    I was using str_replace to replace some parts of a utf-8 string with an ascii string, and I was getting some ? (question mark characters) So i tried your utf8_str_replace, but I am still getting the same characters.
    Do I have to use the UTF-8 to Code Point Array Converter?

    The utf-8 string I am trying to replace is in a PHP variable.

    I don’t have mbstring extension installed and I was hoping to be abel to replace strings with your utf8.php functions.

    Thanks