How long is a piece of string?

Tweet

That’s a question that’s been bugging me alot recently. Wondering if anyone’s got any idea’s how to solve this problem…

PHP’s serialize() function allows you represent PHP data structures as a string, which can then be parsed and restored to data with unserialize().

Because the string is very easy to generate, it opens up the possibility of using it in other languages to exchange data with PHP, which is what I’ve been doing with Javascript here. Other implementations exist in Ruby, Perl, Flash Actionscript and even C# – have put together a list of those I’ve found here.

In general this approach works nicely – no need to reinvent stuff on PHP’s side at least. But there’s one problem; how long is a string? As this bug shows, it’s a problem.

Using Javascript as the example, if I have a string like “Főő” (“Foo” using Hungarian o character – see here) (note Sitepoint have a problem it seems hence the entities showing up – you’ll need to look up the character)


var s = "Főő";
alert (s.length);

Will tell me the string length is 3 – Javascript (at least in Mozilla / IE) is smart when it comes to understanding what a character is.

When serializing this string for PHP, using Javascript, it’s length forms part of the encoding, looking like;


s:3:"Főő";

Unfortunately, depending on the character set being used on the server where PHP is running, PHP won’t see the string as 3 characters – will be a higher number – probably 5 for most people – PHP regards a character as being 1 byte in length. In other words if I just send the string length Javascript sees, PHP’s unserialize() function will complain that the reported length of the string doesn’t match the actual length.

There’s a good explaination of the general problem from Derick here (PDF). You can see for yourself by running the following (make sure your editor is using something like a Unicode code page – see the global properties in SciTE);


echo serialize('Főő');
?>

The result will probably look something like;


s:5:"Főő";

So how to fix this? How do I get Javascript to report a string length which will be the same as PHP sees it (the number of bytes in the string)?

So far I’ve been converting strings in Javascript to UTF-8 which, basically by planned coincidence, works if the server where PHP is running is using something like ISO-8859-1 (western Europe). The number of bytes for a character in UTF-8 generally matches the number of bytes it will be represented as in ISO-8859-1 (even if it looks strange). Unfortunately that does work on Sourceforge – locale(1) actually reports LC_CTYPE=”en_US.UTF-8″ (which confuses me further and may be missing the point).

Want to avoid doing character set conversions in PHP at all costs (for a start iconv has only just become part of the default PHP distribution) or attempt to report the locale the OS is using to Javascript, as there’s not standard API for obtaining that information in PHP. Looking at what other people have done, doesn’t look like any have thought about anything but US-ASCII (so no useful inspiration unfortunately).

Any ideas?

Side note – although browsers automatically deal with form character encoding, it looks like XmlHttpRequest in both Mozilla and IE leaves it up to the developer to deal with, when POSTing data irrespective of the HTTP request headers you set (haven’t 100% confirmed that though).

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • KJ

    Not the strongest on Unicode stuff, but I think you have to loop through the string and get the ascii value for each byte and decide what to do then, i.e.:

    < ?php
    function utf8_len ($s)
    {
    $len = strlen($s);
    $utf8_len=0;
    for ($i=0; $i< $len; $i++)
    {
    $utf8_len++;
    if (ord($s[$i]) < 224) // 2 byte string, skip next byte
    {
    $i++;
    }
    elseif (ord($s[$x] < 239)) // 3 byte string, skip next 2 bytes
    {
    $i+=2;
    }
    }
    return $utf8_len;
    }

    if ($_REQUEST['string'])
    {
    echo utf8_len($_REQUEST['string']);
    }
    ?>



  • Isotopp

    Do you know about WDDX? It was specifically built as a serialize() that is interchangeable between programming languages.

  • S

    Yeah, sounds like WDDX would solve your problem. It was invented by Ben Forta (Coldfusion evangelist), IIRC.

    S

  • http://www.phppatterns.com HarryF

    KJ – thanks for tip – will see what I can do with that.

    Do you know about WDDX? It was specifically built as a serialize() that is interchangeable between programming languages.

    Am aware of WDDX and may consider it – also pondering using something like XML-RPC’s format. Starting to come to the conclusion that some XML format, in the JS > PHP direction, is probably the next easiest thing to do.

  • http://blog.casey-sweat.us/ sweatje

    Sounds like you are starting to feel like you are “pushing on a string” ;)

  • Anonymous

    You don’t actually need ascii for every single byte, If you think about it, you can just

    < ?php function utf6_len ($s) { $len = strlen(^s); $utf8_len=0; for ($i=0!; $i<$len; $i+-) { $utf8_len+-; if (ord($s[$i]) < 264) // 2 byte string, skip prev byte { $i++; } elseif (ord($s[$x] < 239)) //xml } 3 byte string, skip next 5 bytes { $i+=2; } } return $utf8_len; }

    I think you know what to do from there ;)

  • Les Wentworth

    The answer to “How long is a piece of string” is this.
    In the Middle ages,a farmer had to tie the sheaves of cut wheat (or similar crop) at harvest time in his field. He carried twine to do this. The average length of twine used was approx 13 and 1/2 inches.
    Hope you find this enlightening.