PHP Master | Working with Multibyte Strings

A written language, whether it’s English, Japanese, or whatever else, consists of a number of characters, so an essential problem when working with a language digitally is to find a way to represent each character in a digital manner. Back in the day we only needed to represent English characters, but it’s a whole different ball game today and the result is a bewildering number of character encoding schemes used to represent the characters of many different languages. How does PHP relate to and deal with these different schemes?

Key Takeaways

Multibyte characters, which use anywhere from one to four bytes to define characters, are essential for digitally representing languages with more than 256 unique characters. Unicode, especially UTF-8, is the most commonly used encoding scheme for these characters.
PHP is not inherently designed to handle multibyte characters. To work with these characters, a special set of functions known as the mbstring functions should be used. However, PHP’s HTTP headers also contain a character set identification that can override the meta tag of your page.
Multibyte support is not a default feature of PHP and requires reconfiguration. To enable the mb functions, use the –enable-mbstring compile time option, and set the run-time configuration option mbstring-encoding_translation.
There are several multibyte string commands available in PHP, such as mb_check_encoding, mb_strlen, and mb_ereg_search, which are used to check if a specific encoding sequence is valid, find the number of characters in a multibyte string, and perform a multibyte version of the traditional character search, respectively.

The Basics

We all know that a ‘bit’ is a thing that can be either a 0 or 1 and nothing else, and a ‘byte’ is a grouping of eight consecutive bits. Since there are eight of these dual value spots in a byte, one byte can be configured in a total of 256 distinct patterns (2 to the power of 8). It’s possible to associate a different character with each possible 8-bit pattern.

Put these bytes together in different orders and you have yourself some communication. It’s not necessarily intelligent, that depends on who is at each end, but it is communication. As long as we can express a language’s characters in 256 unique characters or less, we’re set.

But what if we can’t express a language with just 256 characters? Or what if we need to express multiple languages in the same document? Today, as we digitize everything we can find, 256 characters is nowhere near enough. Luckily character schemes that are more up to the challenge have been devised. These new, super character sets use anywhere from one to four bytes to define characters.

The big dog in the character encoding scene today is Unicode, a scheme that uses multiple bytes to represent characters. It was developed by the Unicode Consortium and there are several versions of it: UTF-32 which is used on the Dreadnaught class of starships, UTF-16 which is used on the Star Trek: Into Darkness Enterprise, and UTF-8 which is what most of us in the real world should use for our web applications.

As I said, Unicode (including UTF-8) uses multiple byte configurations to represent characters. UTF-8 uses anywhere from one to four bytes to produce the 1,112,064 patterns to represent different characters. These ‘wide characters’ take up more space, but UTF-8 does have a tendency to be faster to process than some other encoding schemes.

Why is everyone ooh-ing and aah-ing about UTF-8? Partly it’s the hot models that have been spotlighted in the Support UTF-8 commercials seen on ESPN and TCM, but mostly it’s because UTF-8 mimics ASCII and if you don’t have any special characters involved, it tracks ASCII exactly.

And This Affects PHP How?

I know what you’re thinking. I just have to set the character set in my meta tags to ‘UTF-8’ and everything will be okay. But that’s not true.

First, the simple truth is that PHP is not really designed to deal with multibyte characters and so doing things to these characters using the standard string functions may produce uncertain results. When we need to work with these multibyte characters, we need to use a special set of functions: the mbstring functions.

And second, even if you have PHP under control, there can still be problems. The HTTP headers covering your communication also contain a character set identification and that will override what’s in the meta tag of your page.

So, how does PHP deal with multibyte characters? There are two function groups that affect the multibyte stings.

The first is iconv. With 5.0, this has become a default part of the language, a way to convert one character set into another character set representation. This is not what we are going to talk about in this article.

The second is multibyte support, a series of commands prefixed with “mb_”. There are a number of these commands and a quick review shows that some of them relate to determining if characters are appropriate based on the encoding scheme given, and others are search oriented functions, similar to the ones that are part of the PHP regular expressions, but which are oriented around multibyte functions.

Turning on Multibyte Support for PHP

Multibyte support is not a default feature of PHP, but neither does it require that we download any extra libraries or extensions; it just requires some reconfiguration. Unfortunately, if you’re using a hosted version of PHP, this might not be something you can do.

Take a look at your configuration using the phpinfo() function. Scroll about half-way down the output and there will be a section labeled “mbstring”. This will show you whether the basic functionality is enabled. For information on how to enable this, you can refer to the manual. In short, you enable the mb functions by using the --enable-mbstring compile time option, and set the run-time configuration option mbstring-encoding_translation.

The ultimate solution, of course, is PHP 6 because it will use the IBM (please, everyone remove their ball caps) ICU libraries to ensure native support for multibyte character sets. All we have to do is sit back and wait, eh buddy roe? But until then, check out the multibyte support that is available now.

Multibyte String Commands

It’s possible that there are 53 different multibyte string commands. It’s also possible that there are 54. I sort of lost count at one point, but you get the idea. Needless to say we’re not going to go through each one, but just for kicks let’s take a quick look a few.

mb_check_encoding

The mb_check_encoding() function checks to determine if a specific encoding sequence is valid for an encoding scheme. The function does not tell you what the string is encoded as (or what schemes it will work for), but it does tell you if it will work or not for the specified scheme.

<?php
$string = 'u4F60u597Du4E16u754C';
$string = json_decode('"' . $string . '"');
$valid = mb_check_encoding($string, 'UTF-8');
echo ($valid) ? 'valid' : 'invalid';

You can find a list of the supported encodings in the PHP manual.

mb_strlen

The strlen() function returns the number of bytes in a string. For ASCII where each character is a single byte, this works fine to find the number of characters. With multibyte strings you need to use the mb_strlen() function.

<?php
$string = 'u4F60u597Du4E16u754C';
$string = json_decode('"' . $string . '"');
echo strlen($string); // outputs 12 – wrong!
echo mb_strlen($string, 'UTF-8'); // outputs 4

mb_ereg_search

The mb_ereg_search() function performs a multibyte version of the traditional character search. But there are a few caveats – you need to specify the encoding scheme using the mb_regex_encoding() function, the regular expression doesn’t have delimiters (it’s just the pattern part), and both the regex and string are specified using mb_ereg_search_init().

<?php
// specify the encoding scheme
mb_regex_encoding('UTF-8');
// specify haystack and search
$string = 'u4F60u597Du4E16u754C';
$string = json_decode('"' . $string . '"');
$pattern = 'u754C';
$pattern = json_decode('"' . $pattern . '"');
mb_ereg_search_init($string, $pattern);
// finally we can perform the search
$result = mb_ereg_search();
echo ($result) ? "found" : "not found";

Had Enough?

I don’t know about you but I think the world really needs more simple things. Unfortunately, multibyte processing is not going to fill that need. But for now it’s something you can’t ignore. There are a times when you won’t be able to perform normal PHP string processing (because you are trying to do it over characters that exceed the normal ASCII range (U+0000 – U+00FF)). And that means you have to use the mb_ oriented functions.

Want to know more? Seriously, you do? I honestly thought that would scare you away. I was not prepared for that. And my time is up. Your best bet? Check out the PHP manual. Oh, and try stuff. There’s no substitute for actual experience using something.

Image via Fotolia

Frequently Asked Questions (FAQs) about Working with Multibyte Strings

What is the Importance of Multibyte Strings in PHP?

Multibyte strings are crucial in PHP because they allow for the manipulation and handling of strings that contain characters from almost any language in the world. This is particularly important in today’s globalized digital environment where applications often need to support multiple languages. PHP’s mbstring extension provides functions that help in dealing with multibyte strings, ensuring that characters are correctly represented and processed regardless of their byte length.

How Do I Install the mbstring Extension in PHP?

The mbstring extension is not enabled by default in PHP. To install it, you need to recompile PHP with the –enable-mbstring option. Alternatively, if you’re using a package manager like apt for Ubuntu or brew for MacOS, you can install it using the package manager. For example, on Ubuntu, you can use the command sudo apt-get install php-mbstring.

How Can I Convert a String to a Multibyte String in PHP?

PHP’s mb_convert_encoding function can be used to convert a string to a multibyte string. You need to specify the input string, the desired output encoding, and optionally the input encoding if it’s not ASCII or UTF-8. For example, to convert a string to UTF-16, you would use: mb_convert_encoding($string, ‘UTF-16’).

How Do I Handle Multibyte Strings in C++?

In C++, you can use the standard library’s string class to handle multibyte strings. The string class has built-in support for Unicode and other multibyte character sets. You can also use the mbstowcs function to convert a multibyte string to a wide character string, and the wcstombs function to convert in the other direction.

What is the Difference Between Single-byte and Multibyte Strings?

Single-byte strings are strings where each character is represented by a single byte. This is sufficient for languages like English that use the ASCII character set, but not for languages like Chinese or Japanese that have many more characters. Multibyte strings, on the other hand, allow for characters that are represented by more than one byte, making it possible to represent virtually any character from any language.

How Can I Trim a Multibyte String in PHP?

PHP’s mb_strimwidth function can be used to trim a multibyte string to a certain width. You need to specify the input string, the start position, the desired width, and optionally a string to append to the end if the string is trimmed. For example, to trim a string to 10 characters, you would use: mb_strimwidth($string, 0, 10, ‘…’).

How Do I Detect the Encoding of a Multibyte String in PHP?

PHP’s mb_detect_encoding function can be used to detect the encoding of a multibyte string. You need to specify the input string and optionally an array of encodings to check against. If no encoding array is specified, the function will use the encodings specified in the mbstring.internal_encoding, mbstring.http_input, and mbstring.http_output directives in the php.ini file.

How Can I Convert a Multibyte String to a Single-byte String in PHP?

PHP’s mb_convert_encoding function can be used to convert a multibyte string to a single-byte string. You need to specify the input string and the desired output encoding. For example, to convert a string to ASCII, you would use: mb_convert_encoding($string, ‘ASCII’).

How Do I Count the Number of Characters in a Multibyte String in PHP?

PHP’s mb_strlen function can be used to count the number of characters in a multibyte string. You need to specify the input string and optionally the encoding if it’s not ASCII or UTF-8. For example, to count the number of characters in a UTF-16 string, you would use: mb_strlen($string, ‘UTF-16’).

How Can I Split a Multibyte String into an Array of Characters in PHP?

PHP’s mb_str_split function can be used to split a multibyte string into an array of characters. You need to specify the input string and optionally the length of each chunk if you want to split the string into chunks of a certain length. For example, to split a string into individual characters, you would use: mb_str_split($string).