Blog Post RSS ?

Blogs » PHP » PHP UTF-8 0.1
 

PHP UTF-8 0.1

by Harry Fuecks

Been messing around with bits of this code for a long time, in fact since first really getting to grips with Dokuwiki, but finally got the first release out.

PHP UTF-8 is intended to make it possible to handle UTF-8 encoded strings in PHP, without requiring the mbstring extension (although it uses mbstring if it’s available). In short, it provides versions of PHP’s string functions (pretty much everything you’ll find on this list), prefixed with utf_ and aware of UTF-8 encoding (that 1character >= 1 byte). It also gives you some tools to help check UTF-8 strings for “well formedness”, strip bad sequences and some “ASCII helpers”.

Parts of the code are cannibalised from elsewhere - thanks to Andi Gohr (Dokuwiki UTF-8) and Henri Sivonen for his UTF-8 to Code Point Array Converter (which was ported to PHP from the Mozilla codebase).

You’ll have to forgive a little pride but, from initial benchmarks on the most critical native (non-mbstring) functions, performance is almost as good as the mbstring functions - within the acceptable range. The key was this inspired tip. Otherwise it’s bending that /u PCRE_UTF8 pattern modifier to good use.

Anyway - documentation is thin on the ground, apart from inline in the source code - need to do a tutorial eventually. For a “from scratch” lead in, best places to go are the charsets and PHP utf-8 pages of the WACT wiki.

A big warning if you plan to use this - do not “blindly” replace all use of PHP’s string functions with functions from this library - most of the functions is provides you will only ever need occasionally, if at all and although performance is “acceptable”, it’s not as fast as the real str* thing. The two key things you must have clear in your mind is that ASCII-7 (US-ASCII) is a subset of UTF-8 and that each valid UTF-8 sequence is unique (cannot be mistaken as a subsequence of another, longer sequence). It’s worth spending a long time looking at the table on this page.

Anyway - hopefully it helps make mbstring independent PHP applications darn near possible. Failing that, PHP6 may be around by the end of the year.

If you liked this blog, share the love:

  • Save to Del.icio.us

This post has 9 responses so far

  1. Looks like I get to be the first to say: congratulations, Harry, outstanding work (as usual).

     
  2. Many thanks although let’s wait and see for the bug reports… I�t�rn�ti�n�liz�ti�n

     
  3. Nice! BTW. I recently added simple romanization support for a bunch of non-latin languages to the UTF-8 lib in DokuWiki.

     
  4. I recently added simple romanization support for a bunch of non-latin languages to the UTF-8 lib in DokuWiki.

    Will definately be swiping that (assuming it’s OK). Might also take a shot at porting Text::Unidecode from CPAN.

     
  5. It is an amazing library. Thanks for the wonderful job Harry :)

    Joomla development team has selected this library to provide utf-8 capabilities to Joomla 1.1 CMS. The PHP-UTF library has been integrated into the core framework of Joomla. A wrapper class provides a standard Joomla API for both core and 3PD extensions.

    The result is that any PHP developer extending the Joomla framework gets utf-8 included thanks to PHP UTF8.

    Even with PHP 6 on the horizon it is great to be able to provide true utf-8 capabilities with backward compatibility to PHP 4.1 and without having to force loading of mbstring on all those shared hosts.

     
  6. […] Following on from this release, the this turned out to be easier than I thought—ported Text::Unidecode to PHP—code available here or track down the utf8_to_ascii package from the main page—released it seperately to keep with the original (Perl artistic) license while the rest of the stuff is under LGPL. […]

     
  7. […] That is, unless you’re using PHP. One of the biggest weaknesses of PHP (up to and including PHP 5.1) is that its built-in string functions handle multi-byte character encodings like UTF-8 and UTF-16 incorrectly. PHP was written with the assumption that one byte equals one character, which simply isn’t the case in such encodings. An optional module or library can be used to provide alternative string functions that do support multi-byte characters, but many of the PHP scripts in circulation use the built-in functions, and simply can’t handle Unicode characters as a result. […]

     
  8. […] For PHP developers especially, where limited out-of-the-box support for UTF-8 keeps many sites on single-byte character encodings, this issue could cause nasty surprises indeed. For example, if you want to add the ability to submit a form via AJAX and keep the standard submission method as a fallback, you could potentially end up having to support two different encodings for that submitted data! Tags: JavaScript, AJAX, PHP […]

     
  9. Hi Harry,

    I was using str_replace to replace some parts of a utf-8 string with an ascii string, and I was getting some ? (question mark characters) So i tried your utf8_str_replace, but I am still getting the same characters.
    Do I have to use the UTF-8 to Code Point Array Converter?

    The utf-8 string I am trying to replace is in a PHP variable.

    I don’t have mbstring extension installed and I was hoping to be abel to replace strings with your utf8.php functions.

    Thanks

     

Sponsored Links

Leave a response

You are not logged in, log in with your SitePoint Forum username and password.

-OR- Post Anonymously

* Make sure any code samples are escaped (i.e. ‘<b>’ becomes ‘&lt;b&gt;’).

If not logged in, your comments will be placed in a moderation queue. This means your comment may not appear until one of our moderators approves it.

SitePoint Marketplace

Buy and sell Websites, templates, domain names, hosting, graphics and more.

Logo Design, Web page Design and more!

99designs

  • Custom logo designs created ‘just for you’.
  • Pick the design you like best.
  • Only pay if you’re satisfied with the result.

The Web Site Revenue Maximizer

New Release

Free PDF Download:

101 Ways To Make Money From Your Website!

Free eBook! Firefox Revealed