Blog Post RSS ?

Blogs » PHP » How long is a piece of string?
 

How long is a piece of string?

by Harry Fuecks

That’s a question that’s been bugging me alot recently. Wondering if anyone’s got any idea’s how to solve this problem…

PHP’s serialize() function allows you represent PHP data structures as a string, which can then be parsed and restored to data with unserialize().

Because the string is very easy to generate, it opens up the possibility of using it in other languages to exchange data with PHP, which is what I’ve been doing with Javascript here. Other implementations exist in Ruby, Perl, Flash Actionscript and even C# - have put together a list of those I’ve found here.

In general this approach works nicely - no need to reinvent stuff on PHP’s side at least. But there’s one problem; how long is a string? As this bug shows, it’s a problem.

Using Javascript as the example, if I have a string like “Főő” (”Foo” using Hungarian o character - see here) (note Sitepoint have a problem it seems hence the entities showing up - you’ll need to look up the character)

var s = "Főő"; alert (s.length);

Will tell me the string length is 3 - Javascript (at least in Mozilla / IE) is smart when it comes to understanding what a character is.

When serializing this string for PHP, using Javascript, it’s length forms part of the encoding, looking like;

s:3:"Főő";

Unfortunately, depending on the character set being used on the server where PHP is running, PHP won’t see the string as 3 characters - will be a higher number - probably 5 for most people - PHP regards a character as being 1 byte in length. In other words if I just send the string length Javascript sees, PHP’s unserialize() function will complain that the reported length of the string doesn’t match the actual length.

There’s a good explaination of the general problem from Derick here (PDF). You can see for yourself by running the following (make sure your editor is using something like a Unicode code page - see the global properties in SciTE);

The result will probably look something like;

s:5:"Főő";

So how to fix this? How do I get Javascript to report a string length which will be the same as PHP sees it (the number of bytes in the string)?

So far I’ve been converting strings in Javascript to UTF-8 which, basically by planned coincidence, works if the server where PHP is running is using something like ISO-8859-1 (western Europe). The number of bytes for a character in UTF-8 generally matches the number of bytes it will be represented as in ISO-8859-1 (even if it looks strange). Unfortunately that does work on Sourceforge - locale(1) actually reports LC_CTYPE=”en_US.UTF-8″ (which confuses me further and may be missing the point).

Want to avoid doing character set conversions in PHP at all costs (for a start iconv has only just become part of the default PHP distribution) or attempt to report the locale the OS is using to Javascript, as there’s not standard API for obtaining that information in PHP. Looking at what other people have done, doesn’t look like any have thought about anything but US-ASCII (so no useful inspiration unfortunately).

Any ideas?

Side note - although browsers automatically deal with form character encoding, it looks like XmlHttpRequest in both Mozilla and IE leaves it up to the developer to deal with, when POSTing data irrespective of the HTTP request headers you set (haven’t 100% confirmed that though).

If you liked this blog, share the love:

  • Save to Del.icio.us

This post has 7 responses so far

  1. Not the strongest on Unicode stuff, but I think you have to loop through the string and get the ascii value for each byte and decide what to do then, i.e.:

    < ?php
    function utf8_len ($s)
    {
    $len = strlen($s);
    $utf8_len=0;
    for ($i=0; $i< $len; $i++)
    {
    $utf8_len++;
    if (ord($s[$i]) < 224) // 2 byte string, skip next byte
    {
    $i++;
    }
    elseif (ord($s[$x] < 239)) // 3 byte string, skip next 2 bytes
    {
    $i+=2;
    }
    }
    return $utf8_len;
    }

    if ($_REQUEST[’string’])
    {
    echo utf8_len($_REQUEST[’string’]);
    }
    ?>

     
  2. Do you know about WDDX? It was specifically built as a serialize() that is interchangeable between programming languages.

     
  3. Yeah, sounds like WDDX would solve your problem. It was invented by Ben Forta (Coldfusion evangelist), IIRC.

    S

     
  4. KJ - thanks for tip - will see what I can do with that.

    Do you know about WDDX? It was specifically built as a serialize() that is interchangeable between programming languages.

    Am aware of WDDX and may consider it - also pondering using something like XML-RPC’s format. Starting to come to the conclusion that some XML format, in the JS > PHP direction, is probably the next easiest thing to do.

     
  5. Sounds like you are starting to feel like you are “pushing on a string” ;)

     
  6. You don’t actually need ascii for every single byte, If you think about it, you can just

    < ?php function utf6_len ($s) { $len = strlen(^s); $utf8_len=0; for ($i=0!; $i<$len; $i+-) { $utf8_len+-; if (ord($s[$i]) < 264) // 2 byte string, skip prev byte { $i++; } elseif (ord($s[$x] < 239)) //xml } 3 byte string, skip next 5 bytes { $i+=2; } } return $utf8_len; }

    I think you know what to do from there ;)

     
  7. The answer to “How long is a piece of string” is this.
    In the Middle ages,a farmer had to tie the sheaves of cut wheat (or similar crop) at harvest time in his field. He carried twine to do this. The average length of twine used was approx 13 and 1/2 inches.
    Hope you find this enlightening.

     

Sponsored Links

Leave a response

You are not logged in, log in with your SitePoint Forum username and password.

-OR- Post Anonymously

* Make sure any code samples are escaped (i.e. ‘<b>’ becomes ‘&lt;b&gt;’).

If not logged in, your comments will be placed in a moderation queue. This means your comment may not appear until one of our moderators approves it.

SitePoint Marketplace

Buy and sell Websites, templates, domain names, hosting, graphics and more.

Logo Design, Web page Design and more!

99designs

  • Custom logo designs created ‘just for you’.
  • Pick the design you like best.
  • Only pay if you’re satisfied with the result.

The Web Site Revenue Maximizer

New Release

Free PDF Download:

101 Ways To Make Money From Your Website!

Free eBook! Firefox Revealed