Tokenization using regular expression sub patterns

A while back was writing some stuff on this blog about regular expressoins. While that remains unfinished, a mini regex example – nothing earth shattering but a useful technique if you hadn’t already seen it.

Prompted by a real world example, one often-overlooked feature of most regular expressions engines is how subpatterns can useful to whip up tokenizers relatively easily. The problem? I needed to match the word any of the words “Canton”, “Region” or “Group” in a string and perform a follow up action depending on which matched.

Dealing with four main languages in Switzerland ( German, French, Italian and English), it get’s a bit more interesting; “Canton” translates to “Kanton” in German and “Cantone” in Italian, while “Region” is “Regione” in Italian. and Group is “Gruppe”, “Groupe” and “Gruppo” in German, French and Italian respectively. Composing those into three straightforward regular expressions I have;

  • Canton: /cantone?|kanton/i
  • Region: /regione?/i
  • Group: /groupe?|grupp(?:o|e)/i

Now on examining some input string, I could try testing each of those regexes individually against the string but that’s a) inefficient and b) likely to lead to lengthier code. Instead I make a single regular expression using sub patterns: /(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))/i …then figure out which sub pattern matched after a match is made.

Note that technically this problem is not really one of tokenization but rather just classifying the input with a common name, but the technique can be fairly easily extended. In PHP the solution is courtesy of the third argument to preg_match(), for example;


$inputs = array( 'Kanton Zuerich', 'Frauenfeld Regione', 'Fricktal Gruppe');

foreach ( $inputs as $input ) {
    preg_match("/(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))/i", $input, $matches);
    print_r($matches);
}

I get output like;


Array
(
    [0] => Kanton
    [1] => Kanton
)
Array
(
    [0] => Regione
    [1] => 
    [2] => Regione
)
Array
(
    [0] => Gruppe
    [1] => 
    [2] => 
    [3] => Gruppe
)

Notice how the first element of this array in always what I matched while elements indexed 1+ correspond to the position of subpattern I matched against, from left to right in the pattern – this I can use to tell me what I actually matched e.g.;


$inputs = array( 'Kanton Zuerich', 'Frauenfeld Regione', 'Fricktal Gruppe');
$tokens = array('canton','region','group'); // the token names

foreach ( $inputs as $input ) {
    
    if ( preg_match("/(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))/i", $input, $matches) ) {
        
        foreach ( array_keys( $matches) as $key) {
            if ( $key == 0 ) { continue; } // skip the first element
            
            // Look for the subpattern we matched...
            if ( $matches[$key] != "" ) {
                printf("Input: '%s',  Token: '%s'n", $input, $tokens[$key-1]);
            }
        }
    }
}

Which gives me output like;


Input: 'Kanton Zuerich',  Token: 'canton'
Input: 'Frauenfeld Regione',  Token: 'region'
Input: 'Fricktal Gruppe',  Token: 'group'

…so I’m now able to classify the input to one of a set of known tokens and react accordingly. Most regex. apis provide something along this lines, for example here’s the same (and much cleaner) in Python, which is what I actually used on this problem;


import re

p = re.compile('(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))', re.I)
inputs = ('Kanton Zuerich', 'Frauenfeld Regione', 'Fricktal Gruppe')
tokens = ('canton','region','group')

for input in inputs:
    m  = p.search(input)
    if not m: continue
    for group, token in zip(m.groups(), tokens):
        if group is not None:
            print "Input: '%s', Token: '%s'" % ( input, token )

Could be reduced further using list comprehensions but don’t think it helps readability in this case.

An alternative problem to give you a feel for how this technique can be applied. Let’s say you want to parse an HTML document and list a subset of the block level vs. the inline level tags it contains. You might do this with two sub-patterns e.g. (</?(?:div|h[1-6]{1}|p|ol|ul|pre).*?>)|(</?(?:span|code|em|strong|a).*?>) (note this regex as-is is geared to python’s idea of greediness – you’d need to change it for PHP) leading to something like this is python;


p = re.compile('(</?(?:div|h[1-6]{1}|p|ol|ul|pre).*?>)|(</?(?:span|code|em|strong|a).*?>)')

for match in p.finditer('foo <div> test <strong>bar</strong> test 1</div> bar'):
    print "[pos: %s] matched %s" % ( match.start(), str(match.groups()) )

The call to match.groups() returns a tuple which tells you which sub pattern matched while match.start() tells you the character position in the document where the match was made, allowing you to pull substrings out of the document.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Andrei Z

    Another good way to tokenize is with preg_split() and PREG_SPLIT_DELIM_CAPTURE option. See my Regex Clinic slides on http://gravitonic.com/talks

  • phpimpact

    Thanks to guys like you Andrei PHP is on the top. Thanks man!

  • Greg Beaver

    PHP_LexerGenerator takes this principle and combines it with the idea of states to implement a re2c-like lexer generator grammar for PHP. http://pear.php.net/PHP_LexerGenerator

  • http://simon.incutio.com/ Skunk

    Interesting fact (well, I think so at least): the Django template engine uses a single regular expression to tokenize the entire template. That’s the main reason it’s so fast compared to many of the other Python template languages out there. Take a look at this: http://code.djangoproject.com/browser/django/trunk/django/template/__init__.py#L70 (in particular line 90, where tag_re is declared).

  • Anonymous

    a single, long regular expression can often be more inefficient than small regular expressions due to backtracking.

  • HarryF

    a single, long regular expression can often be more inefficient than small regular expressions due to backtracking.

    Interesting point although that might well be outweighed having to make additional calls in PHP or Python, depending on what you’re doing.

  • Anonymous