Working with Multibyte Strings

Tweet

A written language, whether it’s English, Japanese, or whatever else, consists of a number of characters, so an essential problem when working with a language digitally is to find a way to represent each character in a digital manner. Back in the day we only needed to represent English characters, but it’s a whole different ball game today and the result is a bewildering number of character encoding schemes used to represent the characters of many different languages. How does PHP relate to and deal with these different schemes?

The Basics

We all know that a ‘bit’ is a thing that can be either a 0 or 1 and nothing else, and a ‘byte’ is a grouping of eight consecutive bits. Since there are eight of these dual value spots in a byte, one byte can be configured in a total of 256 distinct patterns (2 to the power of 8). It’s possible to associate a different character with each possible 8-bit pattern.

Put these bytes together in different orders and you have yourself some communication. It’s not necessarily intelligent, that depends on who is at each end, but it is communication. As long as we can express a language’s characters in 256 unique characters or less, we’re set.

But what if we can’t express a language with just 256 characters? Or what if we need to express multiple languages in the same document? Today, as we digitize everything we can find, 256 characters is nowhere near enough. Luckily character schemes that are more up to the challenge have been devised. These new, super character sets use anywhere from one to four bytes to define characters.

The big dog in the character encoding scene today is Unicode, a scheme that uses multiple bytes to represent characters. It was developed by the Unicode Consortium and there are several versions of it: UTF-32 which is used on the Dreadnaught class of starships, UTF-16 which is used on the Star Trek: Into Darkness Enterprise, and UTF-8 which is what most of us in the real world should use for our web applications.

As I said, Unicode (including UTF-8) uses multiple byte configurations to represent characters. UTF-8 uses anywhere from one to four bytes to produce the 1,112,064 patterns to represent different characters. These ‘wide characters’ take up more space, but UTF-8 does have a tendency to be faster to process than some other encoding schemes.

Why is everyone ooh-ing and aah-ing about UTF-8? Partly it’s the hot models that have been spotlighted in the Support UTF-8 commercials seen on ESPN and TCM, but mostly it’s because UTF-8 mimics ASCII and if you don’t have any special characters involved, it tracks ASCII exactly.

And This Affects PHP How?

I know what you’re thinking. I just have to set the character set in my meta tags to ‘UTF-8’ and everything will be okay. But that’s not true.

First, the simple truth is that PHP is not really designed to deal with multibyte characters and so doing things to these characters using the standard string functions may produce uncertain results. When we need to work with these multibyte characters, we need to use a special set of functions: the mbstring functions.

And second, even if you have PHP under control, there can still be problems. The HTTP headers covering your communication also contain a character set identification and that will override what’s in the meta tag of your page.

So, how does PHP deal with multibyte characters? There are two function groups that affect the multibyte stings.

The first is iconv. With 5.0, this has become a default part of the language, a way to convert one character set into another character set representation. This is not what we are going to talk about in this article.

The second is multibyte support, a series of commands prefixed with “mb_”. There are a number of these commands and a quick review shows that some of them relate to determining if characters are appropriate based on the encoding scheme given, and others are search oriented functions, similar to the ones that are part of the PHP regular expressions, but which are oriented around multibyte functions.

Turning on Multibyte Support for PHP

Multibyte support is not a default feature of PHP, but neither does it require that we download any extra libraries or extensions; it just requires some reconfiguration. Unfortunately, if you’re using a hosted version of PHP, this might not be something you can do.

Take a look at your configuration using the phpinfo() function. Scroll about half-way down the output and there will be a section labeled “mbstring”. This will show you whether the basic functionality is enabled. For information on how to enable this, you can refer to the manual. In short, you enable the mb functions by using the --enable-mbstring compile time option, and set the run-time configuration option mbstring-encoding_translation.

The ultimate solution, of course, is PHP 6 because it will use the IBM (please, everyone remove their ball caps) ICU libraries to ensure native support for multibyte character sets. All we have to do is sit back and wait, eh buddy roe? But until then, check out the multibyte support that is available now.

Multibyte String Commands

It’s possible that there are 53 different multibyte string commands. It’s also possible that there are 54. I sort of lost count at one point, but you get the idea. Needless to say we’re not going to go through each one, but just for kicks let’s take a quick look a few.

mb_check_encoding

The mb_check_encoding() function checks to determine if a specific encoding sequence is valid for an encoding scheme. The function does not tell you what the string is encoded as (or what schemes it will work for), but it does tell you if it will work or not for the specified scheme.

<?php
$string = 'u4F60u597Du4E16u754C';
$string = json_decode('"' . $string . '"');
$valid = mb_check_encoding($string, 'UTF-8');
echo ($valid) ? 'valid' : 'invalid';

You can find a list of the supported encodings in the PHP manual.

mb_strlen

The strlen() function returns the number of bytes in a string. For ASCII where each character is a single byte, this works fine to find the number of characters. With multibyte strings you need to use the mb_strlen() function.

<?php
$string = 'u4F60u597Du4E16u754C';
$string = json_decode('"' . $string . '"');

echo strlen($string); // outputs 12 – wrong!
echo mb_strlen($string, 'UTF-8'); // outputs 4

mb_ereg_search

The mb_ereg_search() function performs a multibyte version of the traditional character search. But there are a few caveats – you need to specify the encoding scheme using the mb_regex_encoding() function, the regular expression doesn’t have delimiters (it’s just the pattern part), and both the regex and string are specified using mb_ereg_search_init().

<?php
// specify the encoding scheme
mb_regex_encoding('UTF-8');

// specify haystack and search
$string = 'u4F60u597Du4E16u754C';
$string = json_decode('"' . $string . '"');

$pattern = 'u754C';
$pattern = json_decode('"' . $pattern . '"');

mb_ereg_search_init($string, $pattern);

// finally we can perform the search 
$result = mb_ereg_search();
echo ($result) ? "found" : "not found";

Had Enough?

I don’t know about you but I think the world really needs more simple things. Unfortunately, multibyte processing is not going to fill that need. But for now it’s something you can’t ignore. There are a times when you won’t be able to perform normal PHP string processing (because you are trying to do it over characters that exceed the normal ASCII range (U+0000 – U+00FF)). And that means you have to use the mb_ oriented functions.

Want to know more? Seriously, you do? I honestly thought that would scare you away. I was not prepared for that. And my time is up. Your best bet? Check out the PHP manual. Oh, and try stuff. There’s no substitute for actual experience using something.

Image via Fotolia

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://www.mancko.com Mancko

    Hi,
    Thanks a lot for this article. I did not know the mb_check_encoding function, so I’ll peruse the documentation to see if I missed other interesting functions.
    I would have added the preg_* functions with the u (for PCRE_UTF8) pattern modifier which enables us to harness the full regexp power with UTF-8 strings, even though it’s only a subset of the whole multibyte encoding schemes (the Dreadnought crew is very limited at the moment).

  • mc

    Any opinion on enabling mbstring.func_overload in php.ini? http://php.net/manual/en/mbstring.overload.php

    • http://dryga.com AndrewDryga

      You should not use it. For example, you wouldn’t be able to call basic strlen (only mb_strlen!), it can produce big amount of bugs.

      • David Shirey

        Very true. You are either in a multi-byte world or you are not. Good point.

  • peter

    “The ultimate solution, of course, is PHP 6 because it will use the IBM (please, everyone remove their ball caps) ICU libraries”

    I thought that the work on PHP 6 was purged due to the issues being faced to make multibyte standard in PHP?

    • David Shirey

      Good question and one that I ran into as soon as I had finished this article. I am trying to get some definitive information on that question and will relay it via a comment.

  • http://dryga.com AndrewDryga

    Actually i do not recommend to use mb_check_encoding, because it can’t determine encoding at all. If you will dive into PHP source, you can find there things like this:
    // ext/mbstring/libmbfl/mbfl/mbfl_ident.c:248
    int mbfl_filt_ident_true(int c, mbfl_identify_filter *filter)
    {
    return c;
    }
    // ext/mbstring/libmbfl/filters/mbfilter_cp1251.c:142
    /* all of this is so ugly now! */
    static int mbfl_filt_ident_cp1251(int c, mbfl_identify_filter *filter)
    {
    if (c >= 0x80 && c flag = 0;
    else
    filter->flag = 1; /* not it */
    return c;
    }

    Original comments preserved. Encoding detection should be made manually, i have article about it, but its in russian. I can translate it, if you want.

    Also instead of mb_ereg* you should use preg_* functions with “u” modifier.

    • David Shirey

      First, thank you for your comment, Andrew. Multi-Byte String processing (mb_) is not without some controversy. And I am not going to presume to tell you whether or not you should use mb_ functions. I am not ready to agree with Andrew’s blanket statement that mb_check_encoding does not really check encoding at all. I will say that mb_ processing is fraught with problems and is not at all straight forward. And I don’t think it is the kind of thing that can be resolved in a single article, it being quite possible that one could write a long novella or short book on the subject before it is properly exhausted. I, for one, would like to see Andrew’s article, and I now deeply regret that fact that I bargained away my right to study Russian when I accepted a professors pity D near the end of Russian II in my sophomore year of college. There are many areas in any programming language that can excite heated debate among veteran programmers, and for PHP, mb support is certainly one. It is my hope that the comments will continue as a way to provide a broad view of practical experience. Anyone use mb_ functions who likes them?

  • Karen

    Having lived (and programmed) in Japan since before UTF-8 was even around (and English-speaking programmers didn’t even talk about ASCII – it was simply “text”), this is not at all a new topic to me. I can’t wait for the time when: (a) English-speaking webmasters learn that they must specify the character set of their pages (lest browsers with other preferred character sets display junk instead of pretty apostrophes); (b) Japanese cell phone makers quit making phones that expect Shift-JIS (there might even still be some that use JIS – I’m not sure) and support UTF-8; (c) PHP supports multibyte natively (your article indicates that PHP6 will do that when the time comes – yay!).

    Also, a warning: the mb_ functions are not always the panacea – in the case of functions using regex, it depends. Here is a forum thread of mine about a problem apparently caused by a bug in the POSIX regex engine in how it handles multibyte: http://tek-tips.com/viewthread.cfm?qid=1700955 In that case the solution was to NOT use the mb_ function, but the PCRE function with the /u switch, provided that PHP was compiled to use multibyte regex (which can’t be assumed for all installations of PHP today, but hopefully that will be true someday!).

    • David Shirey

      Thank you, Karen, these are some excellent comments. If I had the article to write again I would spend more time stressing exactly what you say – that mb is no panacea and that there are a lot of problems and gotchas surrounding it. Thank you for helping to highlight this.