Working with Multibyte Strings
A written language, whether it’s English, Japanese, or whatever else, consists of a number of characters, so an essential problem when working with a language digitally is to find a way to represent each character in a digital manner. Back in the day we only needed to represent English characters, but it’s a whole different ball game today and the result is a bewildering number of character encoding schemes used to represent the characters of many different languages. How does PHP relate to and deal with these different schemes?
We all know that a ‘bit’ is a thing that can be either a 0 or 1 and nothing else, and a ‘byte’ is a grouping of eight consecutive bits. Since there are eight of these dual value spots in a byte, one byte can be configured in a total of 256 distinct patterns (2 to the power of 8). It’s possible to associate a different character with each possible 8-bit pattern.
Put these bytes together in different orders and you have yourself some communication. It’s not necessarily intelligent, that depends on who is at each end, but it is communication. As long as we can express a language’s characters in 256 unique characters or less, we’re set.
But what if we can’t express a language with just 256 characters? Or what if we need to express multiple languages in the same document? Today, as we digitize everything we can find, 256 characters is nowhere near enough. Luckily character schemes that are more up to the challenge have been devised. These new, super character sets use anywhere from one to four bytes to define characters.
The big dog in the character encoding scene today is Unicode, a scheme that uses multiple bytes to represent characters. It was developed by the Unicode Consortium and there are several versions of it: UTF-32 which is used on the Dreadnaught class of starships, UTF-16 which is used on the Star Trek: Into Darkness Enterprise, and UTF-8 which is what most of us in the real world should use for our web applications.
As I said, Unicode (including UTF-8) uses multiple byte configurations to represent characters. UTF-8 uses anywhere from one to four bytes to produce the 1,112,064 patterns to represent different characters. These ‘wide characters’ take up more space, but UTF-8 does have a tendency to be faster to process than some other encoding schemes.
Why is everyone ooh-ing and aah-ing about UTF-8? Partly it’s the hot models that have been spotlighted in the Support UTF-8 commercials seen on ESPN and TCM, but mostly it’s because UTF-8 mimics ASCII and if you don’t have any special characters involved, it tracks ASCII exactly.
And This Affects PHP How?
I know what you’re thinking. I just have to set the character set in my meta tags to ‘UTF-8’ and everything will be okay. But that’s not true.
First, the simple truth is that PHP is not really designed to deal with multibyte characters and so doing things to these characters using the standard string functions may produce uncertain results. When we need to work with these multibyte characters, we need to use a special set of functions: the mbstring functions.
And second, even if you have PHP under control, there can still be problems. The HTTP headers covering your communication also contain a character set identification and that will override what’s in the meta tag of your page.
So, how does PHP deal with multibyte characters? There are two function groups that affect the multibyte stings.
The first is iconv. With 5.0, this has become a default part of the language, a way to convert one character set into another character set representation. This is not what we are going to talk about in this article.
The second is multibyte support, a series of commands prefixed with “mb_”. There are a number of these commands and a quick review shows that some of them relate to determining if characters are appropriate based on the encoding scheme given, and others are search oriented functions, similar to the ones that are part of the PHP regular expressions, but which are oriented around multibyte functions.
Turning on Multibyte Support for PHP
Multibyte support is not a default feature of PHP, but neither does it require that we download any extra libraries or extensions; it just requires some reconfiguration. Unfortunately, if you’re using a hosted version of PHP, this might not be something you can do.
Take a look at your configuration using the
phpinfo() function. Scroll about half-way down the output and there will be a section labeled “mbstring”. This will show you whether the basic functionality is enabled. For information on how to enable this, you can refer to the manual. In short, you enable the mb functions by using the
--enable-mbstring compile time option, and set the run-time configuration option
The ultimate solution, of course, is PHP 6 because it will use the IBM (please, everyone remove their ball caps) ICU libraries to ensure native support for multibyte character sets. All we have to do is sit back and wait, eh buddy roe? But until then, check out the multibyte support that is available now.
Multibyte String Commands
It’s possible that there are 53 different multibyte string commands. It’s also possible that there are 54. I sort of lost count at one point, but you get the idea. Needless to say we’re not going to go through each one, but just for kicks let’s take a quick look a few.
mb_check_encoding() function checks to determine if a specific encoding sequence is valid for an encoding scheme. The function does not tell you what the string is encoded as (or what schemes it will work for), but it does tell you if it will work or not for the specified scheme.
<?php $string = 'u4F60u597Du4E16u754C'; $string = json_decode('"' . $string . '"'); $valid = mb_check_encoding($string, 'UTF-8'); echo ($valid) ? 'valid' : 'invalid';
You can find a list of the supported encodings in the PHP manual.
strlen() function returns the number of bytes in a string. For ASCII where each character is a single byte, this works fine to find the number of characters. With multibyte strings you need to use the
<?php $string = 'u4F60u597Du4E16u754C'; $string = json_decode('"' . $string . '"'); echo strlen($string); // outputs 12 – wrong! echo mb_strlen($string, 'UTF-8'); // outputs 4
mb_ereg_search() function performs a multibyte version of the traditional character search. But there are a few caveats – you need to specify the encoding scheme using the
mb_regex_encoding() function, the regular expression doesn’t have delimiters (it’s just the pattern part), and both the regex and string are specified using
<?php // specify the encoding scheme mb_regex_encoding('UTF-8'); // specify haystack and search $string = 'u4F60u597Du4E16u754C'; $string = json_decode('"' . $string . '"'); $pattern = 'u754C'; $pattern = json_decode('"' . $pattern . '"'); mb_ereg_search_init($string, $pattern); // finally we can perform the search $result = mb_ereg_search(); echo ($result) ? "found" : "not found";
I don’t know about you but I think the world really needs more simple things. Unfortunately, multibyte processing is not going to fill that need. But for now it’s something you can’t ignore. There are a times when you won’t be able to perform normal PHP string processing (because you are trying to do it over characters that exceed the normal ASCII range (U+0000 – U+00FF)). And that means you have to use the mb_ oriented functions.
Want to know more? Seriously, you do? I honestly thought that would scare you away. I was not prepared for that. And my time is up. Your best bet? Check out the PHP manual. Oh, and try stuff. There’s no substitute for actual experience using something.
Image via Fotolia