The power of String.prototype.split() … almost

If you feel you’re not getting enough respect as a web developer, here’s a nice pie [profanity warning - don't click if you're easily offended] to throw at people.

Actually think the “time spent wishing a slow painful death on Bill Gates” segment needs expanding – Bill isn’t directly to blame. In fact it would be great if the IE team could be more forthcoming and put names to features, so we know exactly who to swear at: “Hi, I’m [insert name] and I’m the guy that put an undefined value at end of your array, every time you leave that trailing comma, resulting in bugs that will keep you amused for hours :)”.

A little bitter at the moment after getting stung by this special while playing with a Javascript version of this. Despite all things AJAXy, writing cross browser code still feels like flying blind. Allow me a moment of complaining…

From the spec (p103 / 104);

If separator is a regular expression that contains capturing parentheses, then each time separator is
matched the results (including any undefined results) of the capturing parentheses are spliced into the
output array. [...]

In fact this behaviour is nothing special to Javascript.

For example Perl…


use Data::Dumper;
print Dumper(split(/(:)/, 'a:b:c'));

…output…

$VAR1 = 'a';
$VAR2 = ':';
$VAR3 = 'b';
$VAR4 = ':';
$VAR5 = 'c';

…and PHP…


print_r(preg_split('/(:)/', 'a:b:c', -1, PREG_SPLIT_DELIM_CAPTURE));

…output…

Array
(
    [0] => a
    [1] => :
    [2] => b
    [3] => :
    [4] => c
)

…and Python…


import re
print re.compile('(:)').split('a:b:c')

…output…

['a', ':', 'b', ':', 'c']

In Javascript this might have been as easy as…


alert( "a:b:c".split(/(:)/) );

…which in Firefox (with help from Firebug) gives me;

["a",":","b",":","c"]

Likewise Opera 9 does the right this. But in IE (6)…

a,b,c

Where I my captured seperators!.

As Simon put it;

Why is this a big deal? Because it suddenly makes writing simple parsers and tokenisers a whole heck of a lot easier.

Actually blaming the IE Team is probably unfair – this seems to be a “feature” delivered by the JScript team and appears to have crept into JScript.NET as well, for example with a script like split.js containing;


import System.Windows.Forms;
MessageBox.Show("a:b:c".split(/(:)/));

I can compile it with the jsc compiler in DOS like D:js> C:WINDOWSMicrosoft.NETFrameworkv2.0.50727jsc.exe /nologo split.js then run the output split.exe to get exactly the same – a,b,c. Sigh.

Anyway – more on that lexer some other time (managed to work around this eventually). BTW, if you need something for serious parsing in Javascript (although Moz only) have a look at this compiler generator in Javascript.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://www.eastley.net daniel_eastley

    The Java String object handles split as follows:


    "a:b:c".split(":") = {"a","b","c"};

    So who’s in the wrong here? Personally, coming from a Java background I’d expect it to be handled the way it does in IE although a better solution would be to have an additional parameter to say whether the seperator is included in the output. Then everyone would be happy: :-)


    "a:b:c".split(":",true) = {"a",":","b",":","c"};
    "a:b:c".split(":",false) = {"a","b","c"};

  • http://www.gandullia.com jgandu

    daniel_eastley, remember, split is based on a regular expression, not a string.
    I don’t have access to a Java compiler right now, but try this in Java:
    “a:b:c”.split(“(:)”);

  • http://www.phppatterns.com HarryF

    It seems like Daniel is correct (I’m no Java regex master)…

    
    import java.util.regex.*;
    
    public class Split {
        public static void main(String args[]) throws Exception{
            Pattern splitter = Pattern.compile("(:)");
            String str = "a:b:c";
            String [] pieces = null;
            pieces = splitter.split(str);
            for (int i = 0 ; i < pieces.length ; i++) {
                System.out.println(pieces[i]);
            }
        }
    }
    

    Results in;

    
    a
    b
    c
    
    
    
  • Fenrir2

    irb(main):001:0> s = "a:b:c"
    => "a:b:c"
    irb(main):002:0> s.split /:/
    => ["a", "b", "c"]
    irb(main):003:0> s.split /(:)/
    => ["a", ":", "b", ":", "c"]

  • http://www.calcResult.co.uk omnicity

    Would I be right in assuming that you have shown us a simplified version of your original problem, since I cannot see a reason why you would want it to behave like this?

  • http://www.phppatterns.com HarryF

    Would I be right in assuming that you have shown us a simplified version of your original problem, since I cannot see a reason why you would want it to behave like this?

    That’s right. In general this is a useful feature for building simple parsers. In my case been playing with a port of SimpleTest’s stack-based lexer into Javascript, so this was something buried fairly deep in the code.

  • Pingback: SitePoint Blogs » Microsoft making progress…

  • Steve

    Want to get hot and bothered about another IE split method issue? IE does not include empty values in the array it creates if a regex (rather than a string) is used as the delimiter.

    However, I must admit that I prefer JScript’s way of handling splits which contain capturing groups.

  • Steven Levithan

    This script fixes the cross-browser JavaScript split() inconsistencies & bugs.

  • Anonymous