Why does PHP still not handle Unicode?

m029 · January 8, 2020, 5:11pm

PHP is arguably the simplest language for server-side programming (as compared with asp.net, C++, Java, JSP, Python, and others). Yet, like many older software tools, it provides only token support for Unicode (it accepts any bytes in a string, it has conversion functions for utf-8 that work with one-byte (ISO-8859-1) characters, it offers a limited set of MB_ string functions, etc.).

But if the programmer wishes to support input or output in human language, and naturally wants to manipulate strings using the PHP string functions, the result is failure. One would think that since the de facto Web standard encoding is utf-8, that PHP would be extended to support utf-8 strings natively. I can’t think of a technical reason why such support cannot be added to Zend and PHP. And I include in that belief the fact that there is no upper bound on the length in bytes of a single Unicode character (which may be outside the BMP or may be a grapheme containing many glyphs).

My primary question is: why has PHP not been extended? Why is there not a PHP directive to switch all the string functions (and any directly supporting code and libraries) so they work with utf-8? And a followup question: is there any known workaround (such as an additional library that is easy to add to Apache+PHP)?

The days when programmers dealt only with text in their own native language are over. It’s time to have some substantial eggs with our morning toast.

m_hutley · January 8, 2020, 6:37pm

https://www.php.net/manual/en/book.mbstring.php

m029 · January 8, 2020, 7:35pm

Thank you. I should have tried out the MB_ functions before assuming they wouldn’t work with utf-8 internal strings. Adding and removing the BOM is certainly not much of a problem.

ahundiak · January 8, 2020, 8:31pm

Ever wonder what happened to php 6? The powers that be spent years trying to basically redesign php from the ground up in order for it handle unicode cleanly. They finally gave up and scraped the entire version.

I might add that I find it strange that you think C++ somehow provides built in support for utf strings.

m029 · January 8, 2020, 9:07pm

I never said that. I said that C++ was a server-side programming language. I’m curious if you disagree?

m029 · January 8, 2020, 9:09pm

I wrote a small PHP program to read in my favorite UTF-8 test file (no BOM) and to do simple string manipulation using the MB_ string functions. To my pleasant surprise, it worked perfectly. I will try to delete this posting in 24 hours, since I’m obviously wrong!

spaceshiptrooper · January 8, 2020, 10:26pm

Wrong? No. Misinformed? Yes. People who often come from another language automatically think PHP is like how majority of the chatter is. It’s actually not. Majority of the chatter are just exaggerated or misinformed comments. I don’t think you should delete your comments or thread. It serves a purpose for others who may have the same concern or opinion about PHP.

rpkamp · January 11, 2020, 11:33pm

Meh, not even that. A lot of what’s being said about PHP was actually true at some point. Ever looked at PHP 3 or 4? They were horrible.

PHP got a lot better over time, only people still talk about how it used to be. Stubbornness on their part, or maybe the PHP community is not vocal enough on their latest changes? Probably a bit of both.

spaceshiptrooper · January 12, 2020, 12:05am

The ones I’ve heard aren’t even that. It’s pretty outrageous if you’ve heard it. One exaggerated complaint was about the PHP documentations and that it’s some how “poorly” written. I’ve seen worse documentations and PHP’s isn’t as bad. PHP’s documentations actually lets you know what each argument requires when you’re using a function. It tells you what kind of data type its looking for and the like. Some documentations don’t even do that.

When I heard that, I thought it was ridiculous but a lot of people who use other languages agree to this ridiculous complaint. There’s other absurdly exaggerated complaints as well that tops this one and everyone tries to tie it all into the “main” PHP syntax.

m029 · January 12, 2020, 12:28am

Well, this is all true, and I am using version 7 now.

But after a few days of looking at the rather hard-to-use mb_ string functions, I’ve reluctantly decided to write my own functions tailored to parsing, where the position (offset) is maintained as a byte count, not a character count. This way I can use a mixture of fast substr() for positioning and slow mb_ functions for actual searching, matching, substr(), etc.

I’d actually like to use preg_match, etc., with the “u” unicode flag, but I have no confidence that they will work, as I can’t find full documentation.

I want to use utf-8 ONLY, and never Unicode code points, for simplicity and because code points don’t always represent characters.

There is also an interesting utf-8 PHP library in GitHub (“portable utf-8”), but it seems to be poorly documented, so I don’t trust it.

Mittineague · January 12, 2020, 12:23pm

I am uncertain about what might be behind your “ONLY” and “never”. The “U” in “UTF-8” is short for “Unicode”.

It could be argued that it is best to use whatever tools a language provides for working with a slow process, under the presumption the contributing authors have done as best as possible within the confines of the language.

Good documentation is a very great thing. IMHO it sure would be nice if it were more common. But I don’t know as I would use that as a way to evaluate trustworthiness. Easier to learn and work with, most definitely. I think for trust, test suites are invaluable.

▁

String functions may be faster than multibyte safe functions, though I imagine the speed difference would be relatively negligible. In any case, without the “safe” I think you should make sure you don’t introduce collation errors if they might be a problem.

m029 · January 12, 2020, 1:05pm

One problem with the mb_ functions of PHP is that they go all the way in replacing byte offsets with character offsets. Perhaps surprisingly, this is not what one wants for use in programming! It means that every time you want to apply any mb_ string function at a nonzero offset in a string, the mb_ function has to look at each character up to the given offset. This makes string operations order(n) rather than order(1) in their basic performance.

The reason they have to look at every character is that in utf-8 (our standard character encoding), only English characters and common punctuation have a byte length of 1. Otherwise, byte lengths can vary up to 4 or even higher in certain circumstances. In general, the byte length of character has no particular upper bound. The ideal library for manipulating utf-8 would therefore use byte offsets for speed of access to a character position during processing, but would look at entire characters and grapheme sequences when doing matching and other string operations. The mb_ string functions don’t work this way.

Another problem (you may not be aware of this) is that Unicode is surprisingly complicated. It doesn’t just catalog a great number of vector patterns (glyphs), no. It includes a number of specialized operations, such as moving the rendering position around for combining one or more accent marks with a previous character, or constructing a grapheme composed of many code points (basic Unicode characters), or even for mapping characters from one Unicode code point to another (such as in selecting emoticons/emoji containing color versus being monochrome). Unicode makes the idea of “character” a constructive one.

For this reason, few applications support Unicode correctly, in my experience (I have yet to find a text editor that is correct both in rendering and in moving the caret, and I have evaluated 14 modern editors so far). There is an online tool for testing a string for correctness, although it doesn’t seem to have been validated and tested by the Unicode organization. Because Unicode characters are so richly defined algorithmically, they are difficult to parse, so libraries (including the mb_ functions) are likely to be buggy, meaning unreliable in applications that need to accept and manipulate the whole range of Unicode.

In short, Unicode is complex and its proper support is currently very rare. Even PHP, with its impressive improvement from version to version, still does not offer correct and complete support either for Unicode or for its standard utf-8 encoding.

TheRedDevil · January 12, 2020, 10:46pm

I am not sure what you mean by hard to use mb_ functions. Slower, yes, but I cant see how they are harder to use than the default string options?

We have been using the mb_ options for large scale applications for the last 15 years, and in normal use you wont even notice the difference on the applications run time between using mb_ and not.

Though if you are doing a lot of file reading, and manipulating the content, PHP is not the best language to do that from in the first case, if this is what your trying to do, you will have major speed gains by switching to a language that is better suited for this.

m029 · January 12, 2020, 11:22pm

The mb_ string functions seemed more difficult to use because they have 13 functions starting with “mb_ereg” as compared with 10 much more intuitively named single-byte functions starting with “preg_”. Also, several of the single-byte string functions have not been implemented in mb_ functions. Perhaps I’m wrong, if you’ve had no problem using them for 15 years. Perhaps you can point me to a tutorial for best practice in using mb_ functions?

As to performance, I’m writing a mostly string-processing server-side website framework. I’m worried that using mb_ functions exclusively will be very slow for the reason I gave above.

As to PHP not being the best language for string manipulation, I certainly agree. C, C++, and Go, although lower in language level, are much speedier. But if I’m writing server-side code, and I don’t want to learn Python or JSX (which I don’t if I have a choice, since my time is limited), and I want to produce standard code that will run on any server, and I prefer having short coding/debug cycles, PHP is still a good choice.

m029 · January 13, 2020, 3:24pm

Neither mb_strcut nor mb_substr is ideal for reading variable-length characters from a string. The ideal would be a substring function that takes a byte offset and returns a substring, a new byte offset, and a new character offset, in my opinion. It should assume that the input byte offset is the start of a character (this is easy to guarantee in use).

TheRedDevil · January 13, 2020, 4:52pm

I am sorry, I am unaware of any tutorial for the use of the functionality.

We do not use the mb_ereg etc. functionality, instead we use preg_ with the multibyte/unicode modifier (u). As long as you are certain and test the regex, this usually work great. Though please note you cannot use the position, offset etc. functionality without converting the position to multibyte first, since the underlying functionality are unaware of the modifier you passed with the regex.

A offset modifier like this can easily be created by making a method utilizing mb_strlen etc. to find the correct multibyte positions for the characters.

Multibyte support in PHP was added as an afterthought, which is why is not seamlessly to use. With PHP 6 being cancelled due to the challenges to provide real unicode support, it can only be hoped that with the refactoring and updating of the core code that is being done now, it will be easier to add it in at a later stage.

system · April 13, 2020, 11:52pm

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nervous about UTF-8 breaking my code PHP	9	798	October 8, 2014
Multibyte String Functions PHP	3	675	October 8, 2014
mb_* functions vs normal functions PHP	3	869	October 22, 2015
How to sanitize UTF-8 input efficiently? PHP	8	10528	October 8, 2014
Unicode not display properly (urgent) PHP	3	596	January 25, 2010

Why does PHP still not handle Unicode?

Related topics