Bringing Unicode to PHP with Portable UTF-8
PHP allows multibyte variable names like $a∩b
, $Ʃxy
and $Δx
, mbstring and other extensions work with Unicode strings, and the utf8_encode()
and utf8_decode()
functions translate strings between the UTF-8 and ISO-8859-1 encodings. Yet it’s widely acknowledged that PHP lacks Unicode support.
This article covers what the lack of Unicode support means, and demonstrates the use of a library that brings Unicode support to your PHP application, Portable UTF-8.
Unicode Support in PHP
PHP’s lack of Unicode/multibyte support means that the standard string handling functions treat strings as a sequence of single-byte characters. In fact, the official manual defines a string in PHP as a “series of characters, where a character is the same as a byte.” PHP supports only 8-bit characters, while Unicode (and many other character sets) may require more than one byte to represent a character. This limitation of PHP affects almost all aspects of string manipulation, including (but not limited to) substring extraction, determining string lengths, string splitting, shuffling etc.
Efforts to solve the problem started in early 2005, but the work on bringing native Unicode support to PHP was stopped and shelved in 2010 for several reasons. Since native Unicode support in PHP may take years to come, if ever, developers must rely on extensions like mbstring and iconv that are available to fill the gap but that provide just limited Unicode support. These libraries are not Unicode centric, and are capable to translate between non-Unicode encodings too. They make a positive contribution in an attempt to ease working with Unicode strings.
But the aforementioned extensions also have their shortcomings. They provide just limited functionality for Unicode string handling, and none of them are enabled by default. The server administrator must explicitly enable any or all of the extensions to make them accessible through PHP applications. Shared hosting providers often make the situation worse by installing one or two of the extensions, making it difficult for developers to rely on a consistently available API for their Unicode needs.
Despite all of this, the good thing is that PHP can output Unicode text. This is because PHP doesn’t really care whether we are sending out English text encoded in ASCII or some other text that belongs to a language whose characters are encoded in multiple bytes. Knowing this, what PHP developers now need is only an API that provides comfortable Unicode-based string manipulation.
Portable UTF-8
A recent solution is the creation of user-space libraries written in PHP. These libraries can be easily bundled with an application to ensure the presence of Unicode support, even if support at the server/language level is missing. Many open-source applications already include their own such libraries and many more use freely available third-party libraries; one such library is Portable UTF-8.
Portable UTF-8 is a free, lightweight library built over mbstring and iconv. It extends the capabilities of the two extensions to provide about 60 functions for Unicode-based string manipulation, testing, and validation; it offers UTF-8 aware counterparts for almost all of PHP’s common string-handling functions. As its name suggests, Portable UTF-8 uses UTF-8 as its primary character encoding scheme.
The library uses the available extensions (mbstring and iconv) for speed reasons, and smooths over some of the inconsistencies of working with them directly, but falls back to UTF-8 routines written in pure PHP if the extensions aren’t available on the server. Portable-UT8 is fully portable and works with any installation of PHP version 4.2 or higher.
String Handling with Portable UTF-8
A text editor with bad Unicode support may corrupt text when reading it, and text copied from such an editor and posted into a web form might a source of invalid UTF-8 to your application. When dealing with user submitted input, it’s important that we make sure the input is exactly what the application expects. To detect whether the text is valid UTF-8, we can use the library’s is_utf8()
function.
if (is_utf8($_POST['title'])) {
// do something...
}
Recovering characters from invalid bytes is an impossible exercise, so stripping out the bytes that cannot be recognized as valid UTF-8 characters might be your only option. We can strip invalid bytes with the utf8_clean()
function.
$title = utf8_clean($_POST['title']);
Each Unicode character can be encoded to a corresponding HTML entity, and you may want to encode text this way to help prevent XSS attacks before outputting it to browser.
echo utf8_html_encode($title);
It’s common to trim whitespace at the start and the end of a string. Unicode lists about 20 whitespace characters, and there are some ASCII-based control characters that should be considered as well for such trimming.
$title = utf8_trim($title);
On the other hand, there may be duplicates of such whitespaces in the middle of the string that should be removed. The follow shows how the utf8_remove_duplicates()
and utf8_ws()
can be used together:
$title = utf8_remove_duplicates($title, utf8_ws());
Traditional solutions for creating URL slugs for SEO reasons use transliteration and strip all non-ASCII characters from the slug. That makes a URL less valuable than it could be. While URLs can support UTF-8 encoded characters, there is no need for such stripping or transliteration, and we can create rich slugs containing characters of any language:
$slug = utf8_url_slug($title, 30); // char length 30
From the start with input validation until we save the data to some database, a Unicode-aware application focuses on characters and character-length rather than bytes and byte-length. This shift of focus necessitates a new interface that understands the difference. It’s common to enforce a limit on input character-length, so here we are creating a sub-string if the input exceeds the length of 60 characters.
if (utf8_strlen($title) > 60) {
$title = utf8_substr($title, 0, 60);
}
Or alternatively:
if (!utf8_fits_inside($title , 60)) {
$title = utf8_substr($title, 0 ,60);
}
There are three different ways to access individual character with the Portable-UT8 library. We can use utf8_access()
to reach an individual character.
echo 'The sixth character is: ' . utf8_access($string, 5);
utf8_chr_map()
allows individual character access iteratively using a callback function.
utf8_chr_map('some_callback', $string);
And we can split a string into an array of characters using utf8_split()
and work with the array elements as individual characters.
array_map('some_callback', utf8_split($string));
Working with Unicode may also require that we find the minimum/maximum code point in a string, splitting strings, working with byte order mark, string case conversion, randomizing/shuffling, replacing, etc. All of that is supported by Portable-UT8.
Conclusion
Development of PHP 6 has been stopped, resulting in a delay for the much needed native Unicode support for developing multilingual applications. So in the meantime, server side extensions and user-space libraries like Portable UTF-8 play an important role in helping developers in making a better standardized web that meets local needs.