Tokenization using regular expression sub patterns

    Harry Fuecks

    A while back was writing some stuff on this blog about regular expressoins. While that remains unfinished, a mini regex example – nothing earth shattering but a useful technique if you hadn’t already seen it.

    Prompted by a real world example, one often-overlooked feature of most regular expressions engines is how subpatterns can useful to whip up tokenizers relatively easily. The problem? I needed to match the word any of the words “Canton”, “Region” or “Group” in a string and perform a follow up action depending on which matched.

    Dealing with four main languages in Switzerland ( German, French, Italian and English), it get’s a bit more interesting; “Canton” translates to “Kanton” in German and “Cantone” in Italian, while “Region” is “Regione” in Italian. and Group is “Gruppe”, “Groupe” and “Gruppo” in German, French and Italian respectively. Composing those into three straightforward regular expressions I have;

    • Canton: /cantone?|kanton/i
    • Region: /regione?/i
    • Group: /groupe?|grupp(?:o|e)/i

    Now on examining some input string, I could try testing each of those regexes individually against the string but that’s a) inefficient and b) likely to lead to lengthier code. Instead I make a single regular expression using sub patterns: /(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))/i …then figure out which sub pattern matched after a match is made.

    Note that technically this problem is not really one of tokenization but rather just classifying the input with a common name, but the technique can be fairly easily extended. In PHP the solution is courtesy of the third argument to preg_match(), for example;

    $inputs = array( 'Kanton Zuerich', 'Frauenfeld Regione', 'Fricktal Gruppe');
    foreach ( $inputs as $input ) {
        preg_match("/(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))/i", $input, $matches);

    I get output like;

        [0] => Kanton
        [1] => Kanton
        [0] => Regione
        [1] => 
        [2] => Regione
        [0] => Gruppe
        [1] => 
        [2] => 
        [3] => Gruppe

    Notice how the first element of this array in always what I matched while elements indexed 1+ correspond to the position of subpattern I matched against, from left to right in the pattern – this I can use to tell me what I actually matched e.g.;

    $inputs = array( 'Kanton Zuerich', 'Frauenfeld Regione', 'Fricktal Gruppe');
    $tokens = array('canton','region','group'); // the token names
    foreach ( $inputs as $input ) {
        if ( preg_match("/(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))/i", $input, $matches) ) {
            foreach ( array_keys( $matches) as $key) {
                if ( $key == 0 ) { continue; } // skip the first element
                // Look for the subpattern we matched...
                if ( $matches[$key] != "" ) {
                    printf("Input: '%s',  Token: '%s'n", $input, $tokens[$key-1]);

    Which gives me output like;

    Input: 'Kanton Zuerich',  Token: 'canton'
    Input: 'Frauenfeld Regione',  Token: 'region'
    Input: 'Fricktal Gruppe',  Token: 'group'

    …so I’m now able to classify the input to one of a set of known tokens and react accordingly. Most regex. apis provide something along this lines, for example here’s the same (and much cleaner) in Python, which is what I actually used on this problem;

    import re
    p = re.compile('(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))', re.I)
    inputs = ('Kanton Zuerich', 'Frauenfeld Regione', 'Fricktal Gruppe')
    tokens = ('canton','region','group')
    for input in inputs:
        m  =
        if not m: continue
        for group, token in zip(m.groups(), tokens):
            if group is not None:
                print "Input: '%s', Token: '%s'" % ( input, token )

    Could be reduced further using list comprehensions but don’t think it helps readability in this case.

    An alternative problem to give you a feel for how this technique can be applied. Let’s say you want to parse an HTML document and list a subset of the block level vs. the inline level tags it contains. You might do this with two sub-patterns e.g. (</?(?:div|h[1-6]{1}|p|ol|ul|pre).*?>)|(</?(?:span|code|em|strong|a).*?>) (note this regex as-is is geared to python’s idea of greediness – you’d need to change it for PHP) leading to something like this is python;

    p = re.compile('(</?(?:div|h[1-6]{1}|p|ol|ul|pre).*?>)|(</?(?:span|code|em|strong|a).*?>)')
    for match in p.finditer('foo <div> test <strong>bar</strong> test 1</div> bar'):
        print "[pos: %s] matched %s" % ( match.start(), str(match.groups()) )

    The call to match.groups() returns a tuple which tells you which sub pattern matched while match.start() tells you the character position in the document where the match was made, allowing you to pull substrings out of the document.