Bringing Unicode to PHP with Portable UTF-8

Key Takeaways

Despite PHP’s ability to work with multibyte variable names and Unicode strings, the language lacks comprehensive Unicode support, primarily due to treating strings as a sequence of single-byte characters. This limitation affects various aspects of string manipulation, including substring extraction, determining string lengths, and string splitting.
Portable UTF-8 is a user-space library that brings Unicode support to PHP applications. Built over mbstring and iconv, it provides about 60 functions for Unicode-based string manipulation, testing, and validation and uses UTF-8 as its primary character encoding scheme. The library is fully portable and works with any installation of PHP version 4.2 or higher.
The Portable UTF-8 library provides several functions for handling Unicode strings, including validation of UTF-8 input, stripping invalid bytes, encoding text to HTML entities to prevent XSS attacks, trimming whitespace, removing duplicate whitespaces, creating URL slugs with UTF-8 characters, and enforcing limits on input character-length. This ensures a shift of focus from bytes and byte-length to characters and character-length in a Unicode-aware application.

PHP allows multibyte variable names like $a∩b, $Ʃxy and $Δx, mbstring and other extensions work with Unicode strings, and the utf8_encode() and utf8_decode() functions translate strings between the UTF-8 and ISO-8859-1 encodings. Yet it’s widely acknowledged that PHP lacks Unicode support.

This article covers what the lack of Unicode support means, and demonstrates the use of a library that brings Unicode support to your PHP application, Portable UTF-8.

Unicode Support in PHP

PHP’s lack of Unicode/multibyte support means that the standard string handling functions treat strings as a sequence of single-byte characters. In fact, the official manual defines a string in PHP as a “series of characters, where a character is the same as a byte.” PHP supports only 8-bit characters, while Unicode (and many other character sets) may require more than one byte to represent a character. This limitation of PHP affects almost all aspects of string manipulation, including (but not limited to) substring extraction, determining string lengths, string splitting, shuffling etc.

Efforts to solve the problem started in early 2005, but the work on bringing native Unicode support to PHP was stopped and shelved in 2010 for several reasons. Since native Unicode support in PHP may take years to come, if ever, developers must rely on extensions like mbstring and iconv that are available to fill the gap but that provide just limited Unicode support. These libraries are not Unicode centric, and are capable to translate between non-Unicode encodings too. They make a positive contribution in an attempt to ease working with Unicode strings.

But the aforementioned extensions also have their shortcomings. They provide just limited functionality for Unicode string handling, and none of them are enabled by default. The server administrator must explicitly enable any or all of the extensions to make them accessible through PHP applications. Shared hosting providers often make the situation worse by installing one or two of the extensions, making it difficult for developers to rely on a consistently available API for their Unicode needs.

Despite all of this, the good thing is that PHP can output Unicode text. This is because PHP doesn’t really care whether we are sending out English text encoded in ASCII or some other text that belongs to a language whose characters are encoded in multiple bytes. Knowing this, what PHP developers now need is only an API that provides comfortable Unicode-based string manipulation.

Portable UTF-8

A recent solution is the creation of user-space libraries written in PHP. These libraries can be easily bundled with an application to ensure the presence of Unicode support, even if support at the server/language level is missing. Many open-source applications already include their own such libraries and many more use freely available third-party libraries; one such library is Portable UTF-8.

Portable UTF-8 is a free, lightweight library built over mbstring and iconv. It extends the capabilities of the two extensions to provide about 60 functions for Unicode-based string manipulation, testing, and validation; it offers UTF-8 aware counterparts for almost all of PHP’s common string-handling functions. As its name suggests, Portable UTF-8 uses UTF-8 as its primary character encoding scheme.

The library uses the available extensions (mbstring and iconv) for speed reasons, and smooths over some of the inconsistencies of working with them directly, but falls back to UTF-8 routines written in pure PHP if the extensions aren’t available on the server. Portable-UT8 is fully portable and works with any installation of PHP version 4.2 or higher.

String Handling with Portable UTF-8

A text editor with bad Unicode support may corrupt text when reading it, and text copied from such an editor and posted into a web form might a source of invalid UTF-8 to your application. When dealing with user submitted input, it’s important that we make sure the input is exactly what the application expects. To detect whether the text is valid UTF-8, we can use the library’s is_utf8() function.

if (is_utf8($_POST['title'])) {
    // do something...
}

Recovering characters from invalid bytes is an impossible exercise, so stripping out the bytes that cannot be recognized as valid UTF-8 characters might be your only option. We can strip invalid bytes with the utf8_clean() function.

$title = utf8_clean($_POST['title']);

Each Unicode character can be encoded to a corresponding HTML entity, and you may want to encode text this way to help prevent XSS attacks before outputting it to browser.

echo utf8_html_encode($title);

It’s common to trim whitespace at the start and the end of a string. Unicode lists about 20 whitespace characters, and there are some ASCII-based control characters that should be considered as well for such trimming.

$title = utf8_trim($title);

On the other hand, there may be duplicates of such whitespaces in the middle of the string that should be removed. The follow shows how the utf8_remove_duplicates() and utf8_ws() can be used together:

$title = utf8_remove_duplicates($title, utf8_ws());

Traditional solutions for creating URL slugs for SEO reasons use transliteration and strip all non-ASCII characters from the slug. That makes a URL less valuable than it could be. While URLs can support UTF-8 encoded characters, there is no need for such stripping or transliteration, and we can create rich slugs containing characters of any language:

$slug = utf8_url_slug($title, 30); // char length 30

From the start with input validation until we save the data to some database, a Unicode-aware application focuses on characters and character-length rather than bytes and byte-length. This shift of focus necessitates a new interface that understands the difference. It’s common to enforce a limit on input character-length, so here we are creating a sub-string if the input exceeds the length of 60 characters.

if (utf8_strlen($title) > 60) {
    $title  = utf8_substr($title, 0, 60);
}

Or alternatively:

if (!utf8_fits_inside($title , 60)) {
    $title  = utf8_substr($title, 0 ,60);
}

There are three different ways to access individual character with the Portable-UT8 library. We can use utf8_access() to reach an individual character.

echo 'The sixth character is: ' . utf8_access($string, 5);

utf8_chr_map() allows individual character access iteratively using a callback function.

utf8_chr_map('some_callback', $string);

And we can split a string into an array of characters using utf8_split() and work with the array elements as individual characters.

array_map('some_callback', utf8_split($string));

Working with Unicode may also require that we find the minimum/maximum code point in a string, splitting strings, working with byte order mark, string case conversion, randomizing/shuffling, replacing, etc. All of that is supported by Portable-UT8.

Conclusion

Development of PHP 6 has been stopped, resulting in a delay for the much needed native Unicode support for developing multilingual applications. So in the meantime, server side extensions and user-space libraries like Portable UTF-8 play an important role in helping developers in making a better standardized web that meets local needs.

Frequently Asked Questions (FAQs) about Bringing Unicode to PHP with Portable UTF8

What is the significance of Unicode in PHP?

Unicode is a universal character encoding standard that provides a unique number for every character across various languages and platforms. In PHP, Unicode plays a crucial role in ensuring that the text data is consistently represented and understood, regardless of where it is used. It helps in handling text in an internationalized and language-neutral way, thereby enhancing the global usability of PHP applications.

How does Portable UTF8 help in PHP?

Portable UTF8 is a PHP library that provides Unicode support to PHP applications. It offers a collection of static methods for UTF-8 encoding, string handling, identification, and conversion tasks. It helps in overcoming the limitations of PHP’s native string functions, which are not fully Unicode-compliant. With Portable UTF8, developers can handle Unicode strings more effectively in their PHP code.

How to install and use Portable UTF8 in PHP?

Portable UTF8 can be easily installed using Composer, a dependency management tool for PHP. Once installed, you can use its methods by calling them with the namespace ‘voku\helper\UTF8’. For example, to convert a string to UTF8, you can use the ‘to_utf8’ method like this: ‘UTF8::to_utf8($string)’.

What is the difference between UTF8 and ASCII?

UTF8 and ASCII are both character encoding standards, but they differ in their range and usage. ASCII is a 7-bit encoding standard that can represent 128 characters, primarily used for English. On the other hand, UTF8 is a variable-length encoding standard that can represent over a million characters, making it suitable for almost all languages in the world.

How to convert Unicode codepoints to UTF8 in PHP?

PHP provides a function called ‘json_decode’ that can be used to convert Unicode codepoints to UTF8. Here’s an example: ‘$utf8_string = json_decode(‘”\u202E”‘);’. This will convert the Unicode codepoint U+202E to its corresponding UTF8 string.

How to handle UTF8 strings in MySQL with PHP?

To handle UTF8 strings in MySQL with PHP, you need to set the character set of your MySQL connection to ‘utf8mb4’. This can be done using the ‘mysqli_set_charset’ function in PHP like this: ‘mysqli_set_charset($connection, ‘utf8mb4′);’. This ensures that your MySQL database can store and retrieve UTF8 strings correctly.

How to encode and decode UTF8 strings in PHP?

PHP provides built-in functions for encoding and decoding UTF8 strings. The ‘utf8_encode’ function can be used to encode a string to UTF8, while the ‘utf8_decode’ function can be used to decode a UTF8 string back to ISO-8859-1.

How to handle special characters in UTF8 with PHP?

Special characters in UTF8 can be handled in PHP using the ‘htmlentities’ function. This function converts all applicable characters to their corresponding HTML entities, thereby preserving their original representation in the output.

How to validate a UTF8 string in PHP?

PHP provides a function called ‘mb_check_encoding’ that can be used to validate a UTF8 string. Here’s an example: ‘if (mb_check_encoding($string, ‘UTF8’)) { /* valid UTF8 string / } else { / invalid UTF8 string */ }’. This function returns true if the string is valid UTF8, and false otherwise.

How to handle UTF8 in PHP’s JSON functions?

PHP’s JSON functions natively support UTF8. However, if you’re dealing with non-UTF8 strings, you need to convert them to UTF8 before using them with these functions. This can be done using the ‘utf8_encode’ function. Also, the ‘json_last_error’ function can be used to check for any errors related to UTF8 encoding in the JSON functions.