The Joy of Regular Expressions [2]

So continuing the fun started here…

Part 2

Where we’ve been so far…
Hunting for .jp(e)gs
- Escaping Meta Characters
Search and Replace

Part 3 is here

Where we’ve been so far…

First a quick summary of what we covered in part one;

Expression delimiters e.g. /yes/ or %yes%
Pattern modifier syntax e.g. /yes/i
Meta characters…
- Start and end assertions: ^ and $ e.g. /^yes$/
- Length Quantifiers which apply to the preceding character in the pattern:
  - The “one or more” quantifier: +
  - The “min / max” quantifying curly brackets: e.g. {5,20}
- Introduced character classes e.g. [a-zA-Z0-9]

You also encountered the preg_match() and preg_match_all() functions.

Time for some more syntax, by way of example…

Hunting for .jp(e)gs

Some applications save JPEGs with a file extension .jpeg while everyone else uses .jpg. Now if I’ve got a directory which I know contains some JPEGs, which could be named using either file extension, how do I identify them? And how do I filter out all the other file types in the directory at the same time?


<?php
$dh = opendir('/home/harryf/gallery');

while ( ($file = readdir($dh)) !== FALSE ) {

    if ( preg_match('/^.*.jpe?g$/i', $file ) ) {
        print "$filen";
    }
    
}

closedir($dh);

Zooming in on that pattern – /^.*.jpe?g$/i what have I got? OK the ^ and $ meta characters you’ve seen before and know they match the start and end of the line. Also the /i pattern modifier you know means “case insensitive” – filenames could be upper or lower case. What else do I have here?

The ? is another meta-character: another length quantifier, similar to + and the curly brackets you’ve already seen. It means “zero or exactly one of the preceding character”. So this part of the example: jpe?g means;

I’m looking for a sequence of characters, starting with the letter ‘j’, then ‘p’, then optionally the letter ‘e’ and finally the letter ‘g’

But that’s not the only length quantifier I’ve introduced in this pattern. At the start I also have the * quantifier:

/^.*.jpe?g$/i

The * quantifier means “zero or more of the preceding character” – no maximum limit, no minimum limit.

OK – but what’s the * quantifier being applied to – well the preceding character in the pattern is a period: . which is also a meta-character but more like the character classes you saw in part 1. It means “any character” – it will match anything (there is an exception to that which I’ll come to later). So, combined with the “zero or more” quantifier * the start of the pattern is saying…

I don’t care what the beginning of the filename is – anything is allowed of any length¹ (I’m only interested in the file extension)

Which leaves me only needing to explain what the . in the middle of /^.*.jpe?g$/i means…

Escaping Meta-Characters

Well it’s referring to the literal filename separator period e.g. “mypicture.jpg”. Because the period is normally a meta-character in regular expressions, but as I need it to match the filename seperator, I have to place a backslash in front of it to escape it. Placing a backslash in a pattern tells the regex engine not to regard the following character as a meta-character.

There’s also the preg_quote() function, intended for escaping stuff like user input, to be embedded in a pattern – more on that in a moment.

There’s a little more detail regarding which characters need escaping. One example: inside a regex character class it’s not necessary (although it doesn’t hurt if you do) to escape every meta-characters: some meta-characters, like ‘+’ and ‘*’, automatically assume their literal meaning. Meanwhile other characters need to be escaped in addition to the normal set of meta-characters, if they are intended to have literal meaning, such as ‘-‘ which would normally specify a range in a character class. As you start to memorize the syntax, it will become obvious when and where you need to escape characters – don’t worry too much right now.

You should be aware though that when it comes to excessive escaping, life can get fun, because PHP’s strings also use backslashes for escaping certain characters e.g.;


print 'Tuesday's Child'; # Just a normal string

And more fun if you use double quotes. In an ideal world we’d have literal regular expressions as a PHP feature, like Perl and Javascript. But anyway… most of the time this won’t bother you only when it does, it may drive you mad.

Search and Replace

So far we’ve only been matching. What about some replacing?

A fairly popular feature to add to a site, although a little “non-vogue” since AJAX, is a “highlighter” for visitors that were referred to your site by a search engine. You identify the search term they used by looking at the HTTP referrer and highlight the corresponding words in your HTML, using something like a span tag.

In fact doing this is PHP is probably not the smartest idea – far better to use Javascript and save some server CPU cycles, but it does make a good example to illustrate regex search and replace, plus it highlights some potential security gotchas.

So kicking off, a naive implementation. I won’t attempt to reproduce the HTTP referrer but rather keep it simple, using a URL query which will be placed in the variable $_GET['q']…

Important Note: – this example is not secure (intentionally) – take those fingers off CTRL+C!


<?php
$text = 'The quick brown fox jumps over the lazy dog';

# Do we have a search term?
if ( isset($_GET['q']) ) {
    # Escape the input - make sure it won't contain
    # any regex meta-characters
    $q = preg_quote($_GET['q'], '/');
    
    # Replace and instances of the search term with the
    # same but nested in a span tag...
    $text = preg_replace(
            "/b($q)b/i",                    # Pattern
            '<span class="hilite">$1</span>', # Replacement
            $text                             # Subject
        );
    
}
?>
<html><head><title>Hilite</title>
<style type="text/css">.hilite { background-color: yellow }</style>
</head>
<body>
<?php print $text; ?>
</body>
</html>

OK – let me explain first what the code is doing then go on to explain why it’s not safe. Zooming in on the interesting part…



    # Escape the input - make sure it won't contain
    # any regex meta-characters
    $q = preg_quote($_GET['q'], '/');
    
    # Replace and instances of the search term with the
    # same but nested in a span tag...
    $text = preg_replace(
            "/b($q)b/i",                    # Pattern
            '<span class="hilite">$1</span>', # Replacement
            $text                             # Subject
        );

preg_quote()

The first thing I’m doing here is quoting the incoming query parameter so that if it contains anything that looks like a regex meta character, or any other regex syntax, it will be escaped by a backslash (if you insert a print statement to example the $q, you’ll be able to figure out what’s happening).

Now preg_quote() puts a backslash in front of any of the following characters…

.  + * ? [ ^ ] $ ( ) { } = ! <> | :

That basically nails anything that could be mistaken for regex syntax… except for the expression delimiter. Which is what the second argument to preg_quote() is doing here…

$q = preg_quote($_GET['q'], '/');

The second argument tells preg_quote() which expression delimiter you are using, and so escapes it as well.

A dose of fear and loathing: if you fail to escape user input and then embed it in a regex, you’ve opened the door to command injection – your users will be able to tell your regex engine what to do. At best this will just result in error messages (which you’re hopefully keeping quiet about) while the worst case scenarios could get very ugly, depending on what you’re doing – don’t forget.

preg_replace()

So what’s the next part of this script doing?


    $text = preg_replace(
            "/b($q)b/i",
            '<span class="hilite">$1</span>',
            $text
        );

It’s using the preg_replace() function to wrap all matches of the input search term with a span tag. You’re probably happy with str_replace() right? Well preg_replace() is essentially the same thing, but instead of just plain string substitution, it’s packed with regex goodness.

Now the pattern needs some explaining…

"/b($q)b/i"

The /i pattern modifier at the end you recognise, meaning “case insensitive” – this allows me to highlight more “hits” for the incoming search term.

Word Boundaries, Word Characters… and everything else

What about the b that appears twice? It’s a meta-character meaning “assert a word boundary”. It’s something like the ^ and $ meta characters you’ve seen before but while they assert the start and end of a line, the b meta-character asserts the “edge of a word” e.g. the point where there’s white space, punctuation etc. next to a sequence of word characters. Here’s how the PHP manual defines a word boundary…

A word boundary is a position in the subject string where the current character and the previous character do not both match w or W (i.e. one matches w and the other matches W), or the start or end of the string if the first or last character matches w, respectively.

…Alles klar? The manual is defining word boundaries in terms of two other meta characters, which we haven’t looked at yet: w – “word character” and W. Don’t panic – there’s nothing really new here. Both of these are effectively shorthand for the regex character classes you’ve seen before, that save you having to define your own. Here’s the manual definition for w;

A “word” character is any letter or digit or the underscore character

…and by extension, W is everything else – everything that’s not a word character (such as punctuation, linefeeds and space characters).

Detail overload!: now w is actually not necessarily the same as the character class [a-zA-Z0-9_] – the part of the manual definition I omitted;

The definition of letters and digits is controlled by PCRE’s character tables, and may vary if locale-specific matching is taking place. For example, in the “fr” (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by w.

…in other words w might let those cheeky foreigners sneak non-ASCII characters past your validation patterns! There’s a long story here that I’ll skip but you might get a further hint of insight here – see the note on locales and be aware this applies to b by extension.

So where were we? Trying to figure out what a b “word boundary” means: basically the start or end of any sequence of letters or numbers. In other words its a useful tool to help spot words. So lets look at the complete pattern again…

Sub patterns

"/b($q)b/i"

Given the two word boundaries, we seem to be matching words here then. But what’s this in the middle: ($q) ? Well the $q is just a normal PHP variable – the users search term – which becomes part of the pattern thanks to having using double quotes. So what are the parenthesis doing around it? They are – you guessed it – also regex meta-characters: the delimiters for either side of a sub pattern ².

Sub patterns are a way of grouping parts of a pattern. They have a number of uses (more of which I’ll explore some other time) but for the purposes of our search and replace operation, they’re a way to mark part of the pattern as “especially interesting” – we don’t want that white space either side of the search term the user provided – we just want to capture the word itself, and replace it. To see how this works, you need to look at the next argument to preg_replace();


    $text = preg_replace(
            "/b($q)b/i",
            '<span class="hilite">$1</span>',
            $text
        );

…the '<span class="hilite">$1</span>'. This is what we want to replace the search term with. You can see the span tag there, waiting expectantly, but how does the matched word get embedded in there? Via the $1 – a backreference to the sub pattern. You’ll have to read the documentation for a full description of replacement backreference syntax but, in short, they’re “pointers” to stuff you matched with your pattern. The backreference $0 will always be available, and corresponds to the complete match. Further backreferences like $1 will exist if you used sub patterns (the more sub patterns, the more numbered backreferences will be available).

So summing all this up, if I search for the word “fox”, preg_replace() will find it in $text here: brown fox jumped and replace it with brown <span class="hilite">fox</span> jumped.

Spot the XSS Hole

There’s a small problem though, which you may have already spotted. While I’ve wisely used preg_quote() to escape any input that looks like regex syntax, what if the search term contains HTML / Javascript? Right now it’s potentially open to cross site scripting. Rather than just replace, I need to run the sub pattern match through htmlspecialchars() first. But how?

A little scanning of the pattern modifiers and you’ll find the /e modifier;

If this modifier is set, preg_replace() does normal substitution of backreferences in the replacement string, evaluates it as PHP code, and uses the result for replacing the search string. Single and double quotes are escaped by backslashes in substituted backreferences.

Alright – problem solved…

Warning: this is also potentially insecure and with more serious consequences!


    $text = preg_replace(
            "/b($q)b/ie",
            '"<span class="hilite">".htmlspecialchars("$1")."</span>"',
            $text
        );

eval() is evil!

With the /e modifier attached to the pattern, the replacement string stops being a string and becomes PHP code, ripe for eval(). The moment you hear eval() the word “evil” should be on the tip of your tongue. Here’s a tale of what can happen.

The simple rule is don’t use the /e modifier. Aside from the security implications, the replacement code becomes something that has to be parsed, interpreted and executed on every substitution (if you want real fear and loathing, say “poor performance”). And hey – writing code for PHP’s eval is a mind warping exercise in quotes.

So what’s the alternative?

preg_replace_callback()

Instead of using preg_replace(), use preg_replace_callback(), avoiding evals and offering significantly better performance. Rather than providing it a string of code, you give it the name of a function to execute. Each time it needs to make a replacement, the function will be called with the match. So here’s the final solution;


    function highlight_search($matches) {
        return sprintf(
                '<span class="hilite">%s</span>',
                htmlspecialchars($matches[1])
            );
    }
    
    $text = preg_replace_callback(
            "/b($q)b/i",
            'highlight_search',
            $text
        );

The second argument to preg_replace_callback() is the name of my highlight_search() callback function, which will get called every time there’s a match to be replaced.

The callback function needs to handle a single parameter: $matches, which will always be an indexed array. And in much the same way as the replacement backreferences preg_replace() provides, the first element in the array will always exist and correspond to the complete match – that is $matches[0]. Sub patterns are assigned higher indexes in the array – in this case my sub pattern is at $matches[1] so all I need to do is run it through htmlspecialchars(), embed it inside the span tag with help from sprintf(), return it and I’m done. The returned value gets used as the replacement.

Be aware that the approach I’ve used of simple placing the incoming search term directly in the pattern (after escaping) is probably less than ideal from a usability perspective. Some thought needs to be given to stuff like punctuation in the search term, and it might be smarter to break the search term up into pieces and analyse it a little first. So this example is a sub-optimal solution, but I think manages to illustrate regex search and replace reasonably well.

Alright – that’s a wrap for this installment. More some other time (and I need a break from regexes for while).

¹ And it really means any length – including nothing – .jpg would be a valid filename using this regex – probably smarter to use a + quantifier and require at least one character, but I need an excuse to illustrate *.

² In fact I don’t need the sub pattern for this example, because b is an assertion – it doesn’t actually become part of the match, but this is a good opportunity to illustrate subpatterns, which are frequently used in search and replace operations.