How long is a piece of string?

That’s a question that’s been bugging me alot recently. Wondering if anyone’s got any idea’s how to solve this problem…

PHP’s serialize() function allows you represent PHP data structures as a string, which can then be parsed and restored to data with unserialize().

Because the string is very easy to generate, it opens up the possibility of using it in other languages to exchange data with PHP, which is what I’ve been doing with Javascript here. Other implementations exist in Ruby, Perl, Flash Actionscript and even C# – have put together a list of those I’ve found here.

In general this approach works nicely – no need to reinvent stuff on PHP’s side at least. But there’s one problem; how long is a string? As this bug shows, it’s a problem.

Using Javascript as the example, if I have a string like “Főő” (“Foo” using Hungarian o character – see here) (note Sitepoint have a problem it seems hence the entities showing up – you’ll need to look up the character)

var s = "Főő"; alert (s.length);

Will tell me the string length is 3 – Javascript (at least in Mozilla / IE) is smart when it comes to understanding what a character is.

When serializing this string for PHP, using Javascript, it’s length forms part of the encoding, looking like;

s:3:"Főő";

Unfortunately, depending on the character set being used on the server where PHP is running, PHP won’t see the string as 3 characters – will be a higher number – probably 5 for most people – PHP regards a character as being 1 byte in length. In other words if I just send the string length Javascript sees, PHP’s unserialize() function will complain that the reported length of the string doesn’t match the actual length.

There’s a good explaination of the general problem from Derick here (PDF). You can see for yourself by running the following (make sure your editor is using something like a Unicode code page – see the global properties in SciTE);

The result will probably look something like;

s:5:"Főő";

So how to fix this? How do I get Javascript to report a string length which will be the same as PHP sees it (the number of bytes in the string)?

So far I’ve been converting strings in Javascript to UTF-8 which, basically by planned coincidence, works if the server where PHP is running is using something like ISO-8859-1 (western Europe). The number of bytes for a character in UTF-8 generally matches the number of bytes it will be represented as in ISO-8859-1 (even if it looks strange). Unfortunately that does work on Sourceforge – locale(1) actually reports LC_CTYPE=”en_US.UTF-8″ (which confuses me further and may be missing the point).

Want to avoid doing character set conversions in PHP at all costs (for a start iconv has only just become part of the default PHP distribution) or attempt to report the locale the OS is using to Javascript, as there’s not standard API for obtaining that information in PHP. Looking at what other people have done, doesn’t look like any have thought about anything but US-ASCII (so no useful inspiration unfortunately).

Any ideas?

Side note – although browsers automatically deal with form character encoding, it looks like XmlHttpRequest in both Mozilla and IE leaves it up to the developer to deal with, when POSTing data irrespective of the HTTP request headers you set (haven’t 100% confirmed that though).