Bringing Unicode to PHP with Portable UTF-8

Tweet

PHP allows multibyte variable names like $a∩b, $Ʃxy and $Δx, mbstring and other extensions work with Unicode strings, and the utf8_encode() and utf8_decode() functions translate strings between the UTF-8 and ISO-8859-1 encodings. Yet it’s widely acknowledged that PHP lacks Unicode support.

This article covers what the lack of Unicode support means, and demonstrates the use of a library that brings Unicode support to your PHP application, Portable UTF-8.

Unicode Support in PHP

PHP’s lack of Unicode/multibyte support means that the standard string handling functions treat strings as a sequence of single-byte characters. In fact, the official manual defines a string in PHP as a “series of characters, where a character is the same as a byte.” PHP supports only 8-bit characters, while Unicode (and many other character sets) may require more than one byte to represent a character. This limitation of PHP affects almost all aspects of string manipulation, including (but not limited to) substring extraction, determining string lengths, string splitting, shuffling etc.

Efforts to solve the problem started in early 2005, but the work on bringing native Unicode support to PHP was stopped and shelved in 2010 for several reasons. Since native Unicode support in PHP may take years to come, if ever, developers must rely on extensions like mbstring and iconv that are available to fill the gap but that provide just limited Unicode support. These libraries are not Unicode centric, and are capable to translate between non-Unicode encodings too. They make a positive contribution in an attempt to ease working with Unicode strings.

But the aforementioned extensions also have their shortcomings. They provide just limited functionality for Unicode string handling, and none of them are enabled by default. The server administrator must explicitly enable any or all of the extensions to make them accessible through PHP applications. Shared hosting providers often make the situation worse by installing one or two of the extensions, making it difficult for developers to rely on a consistently available API for their Unicode needs.

Despite all of this, the good thing is that PHP can output Unicode text. This is because PHP doesn’t really care whether we are sending out English text encoded in ASCII or some other text that belongs to a language whose characters are encoded in multiple bytes. Knowing this, what PHP developers now need is only an API that provides comfortable Unicode-based string manipulation.

Portable UTF-8

A recent solution is the creation of user-space libraries written in PHP. These libraries can be easily bundled with an application to ensure the presence of Unicode support, even if support at the server/language level is missing. Many open-source applications already include their own such libraries and many more use freely available third-party libraries; one such library is Portable UTF-8.

Portable UTF-8 is a free, lightweight library built over mbstring and iconv. It extends the capabilities of the two extensions to provide about 60 functions for Unicode-based string manipulation, testing, and validation; it offers UTF-8 aware counterparts for almost all of PHP’s common string-handling functions. As its name suggests, Portable UTF-8 uses UTF-8 as its primary character encoding scheme.

The library uses the available extensions (mbstring and iconv) for speed reasons, and smooths over some of the inconsistencies of working with them directly, but falls back to UTF-8 routines written in pure PHP if the extensions aren’t available on the server. Portable-UT8 is fully portable and works with any installation of PHP version 4.2 or higher.

String Handling with Portable UTF-8

A text editor with bad Unicode support may corrupt text when reading it, and text copied from such an editor and posted into a web form might a source of invalid UTF-8 to your application. When dealing with user submitted input, it’s important that we make sure the input is exactly what the application expects. To detect whether the text is valid UTF-8, we can use the library’s is_utf8() function.

if (is_utf8($_POST['title'])) {
    // do something...
}

Recovering characters from invalid bytes is an impossible exercise, so stripping out the bytes that cannot be recognized as valid UTF-8 characters might be your only option. We can strip invalid bytes with the utf8_clean() function.

$title = utf8_clean($_POST['title']);

Each Unicode character can be encoded to a corresponding HTML entity, and you may want to encode text this way to help prevent XSS attacks before outputting it to browser.

echo utf8_html_encode($title);

It’s common to trim whitespace at the start and the end of a string. Unicode lists about 20 whitespace characters, and there are some ASCII-based control characters that should be considered as well for such trimming.

$title = utf8_trim($title);

On the other hand, there may be duplicates of such whitespaces in the middle of the string that should be removed. The follow shows how the utf8_remove_duplicates() and utf8_ws() can be used together:

$title = utf8_remove_duplicates($title, utf8_ws());

Traditional solutions for creating URL slugs for SEO reasons use transliteration and strip all non-ASCII characters from the slug. That makes a URL less valuable than it could be. While URLs can support UTF-8 encoded characters, there is no need for such stripping or transliteration, and we can create rich slugs containing characters of any language:

$slug = utf8_url_slug($title, 30); // char length 30

From the start with input validation until we save the data to some database, a Unicode-aware application focuses on characters and character-length rather than bytes and byte-length. This shift of focus necessitates a new interface that understands the difference. It’s common to enforce a limit on input character-length, so here we are creating a sub-string if the input exceeds the length of 60 characters.

if (utf8_strlen($title) > 60) {
    $title  = utf8_substr($title, 0, 60);
}

Or alternatively:

if (!utf8_fits_inside($title , 60)) {
    $title  = utf8_substr($title, 0 ,60);
}

There are three different ways to access individual character with the Portable-UT8 library. We can use utf8_access() to reach an individual character.

echo 'The sixth character is: ' . utf8_access($string, 5);

utf8_chr_map() allows individual character access iteratively using a callback function.

utf8_chr_map('some_callback', $string);

And we can split a string into an array of characters using utf8_split() and work with the array elements as individual characters.

array_map('some_callback', utf8_split($string));

Working with Unicode may also require that we find the minimum/maximum code point in a string, splitting strings, working with byte order mark, string case conversion, randomizing/shuffling, replacing, etc. All of that is supported by Portable-UT8.

Conclusion

Development of PHP 6 has been stopped, resulting in a delay for the much needed native Unicode support for developing multilingual applications. So in the meantime, server side extensions and user-space libraries like Portable UTF-8 play an important role in helping developers in making a better standardized web that meets local needs.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Lars Gunther

    This is a great tip and I might very well end up using this library.

    I have a few small things I’d like to point out, though.

    1. Reading data using the superglobals really should be deprecated in favour of the input-filters. Consistently using only the way that allows for the best security will prove worthwhile in the long run, IMO. It seems many of the functions in this library can be used as callbacks.

    2. As a non US-ASCII mother language person I’d recommend against using utf8_html_encode on output. It absolutely destroys both view-source and possibly page size. Use htmlpurifier or a similar library to clean all HTML instead.

    3. utf8_url_slug seems like a nice function. I am definitely going to check it out. However, many user also need to be able to type an URL. If the site in question will attract users from other countries/languages than your own, some pesky characters might not be so easy to type. For that reason I still transliterate å => a, ä => a and ö => o on my Swedish sites’ URLs.

  • Hamid Sarfraz

    @lars
    An API has to be complete, whether or not certain applications require/use a feature or not.

    Thanks for your thoughts.

  • Joseph Scott

    Unfortunately Portable UTF-8 isn’t open source.

  • Dave

    @Lars Regarding your third point, I could understand having your main url and any main landing pages e.g. main page of a blog in ASCII. However, I wouldn’t of thought users would be typing urls to other pages, especially users that don’t use the language your content is written in?

    The main reason for using ASCII only urls (in my experience) is that search engines can have problems with non-ASCII characters in URLs (particularly those encoded as per RFC 3987).

  • chris

    No composer :(

  • Anonymous

    Question: How easy would it be to modify this library such that if it were installed on a system, then use of strlen(), substr() etc were automagically mapped to their utf8_xxx() equivalents, possibly via a call to ini_set()?

    This basically means dev’s wouldn’t have to lookup the lib’s own function names, they could use native PHP functions, calls to which would first be run through the library, to see if a function mapping existed.

    I can see that enabling this out-of-the-box would be a little dangerous for legacy applications, but via an ini_set() call to explicitly enable/disable it either in logic or server config, would help here.

  • Anonymous
  • Anonymous

    @harikt
    Patchwork UTF-8 is a good library but it is for PHP 5.3 or above, while you will appreciate that Portable UTF-8 can run on far older versions too.

    • Anonymous

      I am a good fan of PHP 5.3+ libraries :-) . I love to move with the newest technologies.

  • Ghulam Murtza

    it seems much helpful