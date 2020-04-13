One problem with the mb_ functions of PHP is that they go all the way in replacing byte offsets with character offsets. Perhaps surprisingly, this is not what one wants for use in programming! It means that every time you want to apply any mb_ string function at a nonzero offset in a string, the mb_ function has to look at each character up to the given offset. This makes string operations order(n) rather than order(1) in their basic performance.

The reason they have to look at every character is that in utf-8 (our standard character encoding), only English characters and common punctuation have a byte length of 1. Otherwise, byte lengths can vary up to 4 or even higher in certain circumstances. In general, the byte length of character has no particular upper bound. The ideal library for manipulating utf-8 would therefore use byte offsets for speed of access to a character position during processing, but would look at entire characters and grapheme sequences when doing matching and other string operations. The mb_ string functions don’t work this way.

Another problem (you may not be aware of this) is that Unicode is surprisingly complicated. It doesn’t just catalog a great number of vector patterns (glyphs), no. It includes a number of specialized operations, such as moving the rendering position around for combining one or more accent marks with a previous character, or constructing a grapheme composed of many code points (basic Unicode characters), or even for mapping characters from one Unicode code point to another (such as in selecting emoticons/emoji containing color versus being monochrome). Unicode makes the idea of “character” a constructive one.

For this reason, few applications support Unicode correctly, in my experience (I have yet to find a text editor that is correct both in rendering and in moving the caret, and I have evaluated 14 modern editors so far). There is an online tool for testing a string for correctness, although it doesn’t seem to have been validated and tested by the Unicode organization. Because Unicode characters are so richly defined algorithmically, they are difficult to parse, so libraries (including the mb_ functions) are likely to be buggy, meaning unreliable in applications that need to accept and manipulate the whole range of Unicode.

In short, Unicode is complex and its proper support is currently very rare. Even PHP, with its impressive improvement from version to version, still does not offer correct and complete support either for Unicode or for its standard utf-8 encoding.