The Joy of Regular Expressions [2]

So continuing the fun started here

Contents

Part 2

Part 3 is here

Where we’ve been so far…

First a quick summary of what we covered in part one;

  • Expression delimiters e.g. /yes/ or %yes%
  • Pattern modifier syntax e.g. /yes/i
  • Meta characters…
    • Start and end assertions: ^ and $ e.g. /^yes$/
    • Length Quantifiers which apply to the preceding character in the pattern:
      • The “one or more” quantifier: +
      • The “min / max” quantifying curly brackets: e.g. {5,20}
    • Introduced character classes e.g. [a-zA-Z0-9]

You also encountered the preg_match() and preg_match_all() functions.

Time for some more syntax, by way of example…

Hunting for .jp(e)gs

Some applications save JPEGs with a file extension .jpeg while everyone else uses .jpg. Now if I’ve got a directory which I know contains some JPEGs, which could be named using either file extension, how do I identify them? And how do I filter out all the other file types in the directory at the same time?


<?php
$dh = opendir('/home/harryf/gallery');

while ( ($file = readdir($dh)) !== FALSE ) {

    if ( preg_match('/^.*.jpe?g$/i', $file ) ) {
        print "$filen";
    }
    
}

closedir($dh);

Zooming in on that pattern – /^.*.jpe?g$/i what have I got? OK the ^ and $ meta characters you’ve seen before and know they match the start and end of the line. Also the /i pattern modifier you know means “case insensitive” – filenames could be upper or lower case. What else do I have here?

The ? is another meta-character: another length quantifier, similar to + and the curly brackets you’ve already seen. It means “zero or exactly one of the preceding character”. So this part of the example: jpe?g means;

I’m looking for a sequence of characters, starting with the letter ‘j’, then ‘p’, then optionally the letter ‘e’ and finally the letter ‘g’

But that’s not the only length quantifier I’ve introduced in this pattern. At the start I also have the * quantifier:

/^.*.jpe?g$/i

The * quantifier means “zero or more of the preceding character” – no maximum limit, no minimum limit.

OK – but what’s the * quantifier being applied to – well the preceding character in the pattern is a period: . which is also a meta-character but more like the character classes you saw in part 1. It means “any character” – it will match anything (there is an exception to that which I’ll come to later). So, combined with the “zero or more” quantifier * the start of the pattern is saying…

I don’t care what the beginning of the filename is – anything is allowed of any length1 (I’m only interested in the file extension)

Which leaves me only needing to explain what the . in the middle of /^.*.jpe?g$/i means…

Escaping Meta-Characters

Well it’s referring to the literal filename separator period e.g. “mypicture.jpg”. Because the period is normally a meta-character in regular expressions, but as I need it to match the filename seperator, I have to place a backslash in front of it to escape it. Placing a backslash in a pattern tells the regex engine not to regard the following character as a meta-character.

There’s also the preg_quote() function, intended for escaping stuff like user input, to be embedded in a pattern – more on that in a moment.

There’s a little more detail regarding which characters need escaping. One example: inside a regex character class it’s not necessary (although it doesn’t hurt if you do) to escape every meta-characters: some meta-characters, like ‘+’ and ‘*’, automatically assume their literal meaning. Meanwhile other characters need to be escaped in addition to the normal set of meta-characters, if they are intended to have literal meaning, such as ‘-’ which would normally specify a range in a character class. As you start to memorize the syntax, it will become obvious when and where you need to escape characters – don’t worry too much right now.

You should be aware though that when it comes to excessive escaping, life can get fun, because PHP’s strings also use backslashes for escaping certain characters e.g.;


print 'Tuesday's Child'; # Just a normal string

And more fun if you use double quotes. In an ideal world we’d have literal regular expressions as a PHP feature, like Perl and Javascript. But anyway… most of the time this won’t bother you only when it does, it may drive you mad.

Search and Replace

So far we’ve only been matching. What about some replacing?

A fairly popular feature to add to a site, although a little “non-vogue” since AJAX, is a “highlighter” for visitors that were referred to your site by a search engine. You identify the search term they used by looking at the HTTP referrer and highlight the corresponding words in your HTML, using something like a span tag.

In fact doing this is PHP is probably not the smartest idea – far better to use Javascript and save some server CPU cycles, but it does make a good example to illustrate regex search and replace, plus it highlights some potential security gotchas.

So kicking off, a naive implementation. I won’t attempt to reproduce the HTTP referrer but rather keep it simple, using a URL query which will be placed in the variable $_GET['q']

Important Note: – this example is not secure (intentionally) – take those fingers off CTRL+C!


<?php
$text = 'The quick brown fox jumps over the lazy dog';

# Do we have a search term?
if ( isset($_GET['q']) ) {
    # Escape the input - make sure it won't contain
    # any regex meta-characters
    $q = preg_quote($_GET['q'], '/');
    
    # Replace and instances of the search term with the
    # same but nested in a span tag...
    $text = preg_replace(
            "/b($q)b/i",                    # Pattern
            '<span class="hilite">$1</span>', # Replacement
            $text                             # Subject
        );
    
}
?>
<html><head><title>Hilite</title>
<style type="text/css">.hilite { background-color: yellow }</style>
</head>
<body>
<?php print $text; ?>
</body>
</html>

OK – let me explain first what the code is doing then go on to explain why it’s not safe. Zooming in on the interesting part…



    # Escape the input - make sure it won't contain
    # any regex meta-characters
    $q = preg_quote($_GET['q'], '/');
    
    # Replace and instances of the search term with the
    # same but nested in a span tag...
    $text = preg_replace(
            "/b($q)b/i",                    # Pattern
            '<span class="hilite">$1</span>', # Replacement
            $text                             # Subject
        );


preg_quote()

The first thing I’m doing here is quoting the incoming query parameter so that if it contains anything that looks like a regex meta character, or any other regex syntax, it will be escaped by a backslash (if you insert a print statement to example the $q, you’ll be able to figure out what’s happening).

Now preg_quote() puts a backslash in front of any of the following characters…

.  + * ? [ ^ ] $ ( ) { } = ! <> | :

That basically nails anything that could be mistaken for regex syntax… except for the expression delimiter. Which is what the second argument to preg_quote() is doing here…

$q = preg_quote($_GET['q'], '/');

The second argument tells preg_quote() which expression delimiter you are using, and so escapes it as well.

A dose of fear and loathing: if you fail to escape user input and then embed it in a regex, you’ve opened the door to command injection – your users will be able to tell your regex engine what to do. At best this will just result in error messages (which you’re hopefully keeping quiet about) while the worst case scenarios could get very ugly, depending on what you’re doing – don’t forget.

preg_replace()

So what’s the next part of this script doing?


    $text = preg_replace(
            "/b($q)b/i",
            '<span class="hilite">$1</span>',
            $text
        );

It’s using the preg_replace() function to wrap all matches of the input search term with a span tag. You’re probably happy with str_replace() right? Well preg_replace() is essentially the same thing, but instead of just plain string substitution, it’s packed with regex goodness.

Now the pattern needs some explaining…

"/b($q)b/i"

The /i pattern modifier at the end you recognise, meaning “case insensitive” – this allows me to highlight more “hits” for the incoming search term.

Word Boundaries, Word Characters… and everything else

What about the b that appears twice? It’s a meta-character meaning “assert a word boundary”. It’s something like the ^ and $ meta characters you’ve seen before but while they assert the start and end of a line, the b meta-character asserts the “edge of a word” e.g. the point where there’s white space, punctuation etc. next to a sequence of word characters. Here’s how the PHP manual defines a word boundary…

A word boundary is a position in the subject string where the current character and the previous character do not both match w or W (i.e. one matches w and the other matches W), or the start or end of the string if the first or last character matches w, respectively.

…Alles klar? The manual is defining word boundaries in terms of two other meta characters, which we haven’t looked at yet: w – “word character” and W. Don’t panic – there’s nothing really new here. Both of these are effectively shorthand for the regex character classes you’ve seen before, that save you having to define your own. Here’s the manual definition for w;

A “word” character is any letter or digit or the underscore character

…and by extension, W is everything else – everything that’s not a word character (such as punctuation, linefeeds and space characters).

Detail overload!: now w is actually not necessarily the same as the character class [a-zA-Z0-9_] – the part of the manual definition I omitted;

The definition of letters and digits is controlled by PCRE’s character tables, and may vary if locale-specific matching is taking place. For example, in the “fr” (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by w.

…in other words w might let those cheeky foreigners sneak non-ASCII characters past your validation patterns! There’s a long story here that I’ll skip but you might get a further hint of insight here – see the note on locales and be aware this applies to b by extension.

So where were we? Trying to figure out what a b “word boundary” means: basically the start or end of any sequence of letters or numbers. In other words its a useful tool to help spot words. So lets look at the complete pattern again…

Sub patterns

"/b($q)b/i"

Given the two word boundaries, we seem to be matching words here then. But what’s this in the middle: ($q) ? Well the $q is just a normal PHP variable – the users search term – which becomes part of the pattern thanks to having using double quotes. So what are the parenthesis doing around it? They are – you guessed it – also regex meta-characters: the delimiters for either side of a sub pattern 2.

Sub patterns are a way of grouping parts of a pattern. They have a number of uses (more of which I’ll explore some other time) but for the purposes of our search and replace operation, they’re a way to mark part of the pattern as “especially interesting” – we don’t want that white space either side of the search term the user provided – we just want to capture the word itself, and replace it. To see how this works, you need to look at the next argument to preg_replace();


    $text = preg_replace(
            "/b($q)b/i",
            '<span class="hilite">$1</span>',
            $text
        );

…the '<span class="hilite">$1</span>'. This is what we want to replace the search term with. You can see the span tag there, waiting expectantly, but how does the matched word get embedded in there? Via the $1 – a backreference to the sub pattern. You’ll have to read the documentation for a full description of replacement backreference syntax but, in short, they’re “pointers” to stuff you matched with your pattern. The backreference $0 will always be available, and corresponds to the complete match. Further backreferences like $1 will exist if you used sub patterns (the more sub patterns, the more numbered backreferences will be available).

So summing all this up, if I search for the word “fox”, preg_replace() will find it in $text here: brown fox jumped and replace it with brown <span class="hilite">fox</span> jumped.

Spot the XSS Hole

There’s a small problem though, which you may have already spotted. While I’ve wisely used preg_quote() to escape any input that looks like regex syntax, what if the search term contains HTML / Javascript? Right now it’s potentially open to cross site scripting. Rather than just replace, I need to run the sub pattern match through htmlspecialchars() first. But how?

A little scanning of the pattern modifiers and you’ll find the /e modifier;

If this modifier is set, preg_replace() does normal substitution of backreferences in the replacement string, evaluates it as PHP code, and uses the result for replacing the search string. Single and double quotes are escaped by backslashes in substituted backreferences.

Alright – problem solved…

Warning: this is also potentially insecure and with more serious consequences!


    $text = preg_replace(
            "/b($q)b/ie",
            '"<span class="hilite">".htmlspecialchars("$1")."</span>"',
            $text
        );

eval() is evil!

With the /e modifier attached to the pattern, the replacement string stops being a string and becomes PHP code, ripe for eval(). The moment you hear eval() the word “evil” should be on the tip of your tongue. Here’s a tale of what can happen.

The simple rule is don’t use the /e modifier. Aside from the security implications, the replacement code becomes something that has to be parsed, interpreted and executed on every substitution (if you want real fear and loathing, say “poor performance”). And hey – writing code for PHP’s eval is a mind warping exercise in quotes.

So what’s the alternative?

preg_replace_callback()

Instead of using preg_replace(), use preg_replace_callback(), avoiding evals and offering significantly better performance. Rather than providing it a string of code, you give it the name of a function to execute. Each time it needs to make a replacement, the function will be called with the match. So here’s the final solution;


    function highlight_search($matches) {
        return sprintf(
                '<span class="hilite">%s</span>',
                htmlspecialchars($matches[1])
            );
    }
    
    $text = preg_replace_callback(
            "/b($q)b/i",
            'highlight_search',
            $text
        );

The second argument to preg_replace_callback() is the name of my highlight_search() callback function, which will get called every time there’s a match to be replaced.

The callback function needs to handle a single parameter: $matches, which will always be an indexed array. And in much the same way as the replacement backreferences preg_replace() provides, the first element in the array will always exist and correspond to the complete match – that is $matches[0]. Sub patterns are assigned higher indexes in the array – in this case my sub pattern is at $matches[1] so all I need to do is run it through htmlspecialchars(), embed it inside the span tag with help from sprintf(), return it and I’m done. The returned value gets used as the replacement.

Be aware that the approach I’ve used of simple placing the incoming search term directly in the pattern (after escaping) is probably less than ideal from a usability perspective. Some thought needs to be given to stuff like punctuation in the search term, and it might be smarter to break the search term up into pieces and analyse it a little first. So this example is a sub-optimal solution, but I think manages to illustrate regex search and replace reasonably well.

Alright – that’s a wrap for this installment. More some other time (and I need a break from regexes for while).

1 And it really means any length – including nothing – .jpg would be a valid filename using this regex – probably smarter to use a + quantifier and require at least one character, but I need an excuse to illustrate *.

2 In fact I don’t need the sub pattern for this example, because b is an assertion – it doesn’t actually become part of the match, but this is a good opportunity to illustrate subpatterns, which are frequently used in search and replace operations.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • binjured

    “The moment you hear eval() the word “evil” should be on the tip of your tongue”

    I believe more accurately, the moment you hear eval() the words “poor programmers” should be on the tip of your tongue. While I believe in this case, and really any case where smarter alternative exist, the use of eval() should be avoided (more for performance reasons than security reasons imho), it’s become tiresome to constantly hear about the “evils of eval”. Using that logic, mysql_query() is nearly as evil as eval() because it allows malicious code to be inserted into a script.

    In general, are there better choices than eval() for a given situation? Yes. Is eval() often a lazy way out? Yes. Does it have poor performance? Yes. But to call it evil because it’s “insecure” is inaccurate. In the wrong person’s hands most anything can be considered “evil” because using unsanitized input opens the door wide open for any number of attacks. Avoiding the use of eval() is generally good practice on a number of levels, but a competent developer should have no fears of using it because they should already be assured of the security of their user input.

  • http://www.phppatterns.com HarryF

    But to call it evil because it’s “insecure” is inaccurate.

    Agreed – it’s just another tool. The “eval() is evil” mantra is much like the “you must normalize your dbs to the nth form” – for starters is saves the need for long discussion and it’s also encouraging certain practices – what’s less often discussed is when you’d want to denormalize your db schemas, often for efficiency, but “denormalize your tables!” or “eval is OK sometimes” is not something you hear.

    Think the particular danger here is the /e modifier helps you write a neat one-liner while preg_replace_callback is less “elegant” (relative term) – that might catch out people who’d no better otherwise.

  • http://www.phpism.net Maarten Manders

    Is there an easy way to explicitly match linebreaks, no matter whether they’re unix/windows/mac style?

  • http://www.phppatterns.com HarryF

    Is there an easy way to explicitly match linebreaks, no matter whether they’re unix/windows/mac style?

    Don’t have a generic solution off-hand so if you’ve got one, feel free to blow my mind ;)

    Somehow think it’s probably best to normalize to n before starting to use regexs, given this is what the . metacharacter “understands” (controlled by the /s pattern modifier).

    But if you wanted to split a string into lines, without caring what kind of linebreaks you’re dealing with, this should work;


    $lines = preg_split('/rn?|n/', $text);

    To normalize up front, using strtr should be pretty efficient;


    $text = strtr($text, array( "rn"=>"n" , "r"=>"n" ));

  • http://www.phpism.net Maarten Manders

    Harry, didn’t you mean str_replace? For normalization we’re using

      
    str_replace(array("rn", "r"), array("n", "n"), $input)
    
  • http://www.phpism.net Maarten Manders

    The replacement doesn’t need to be an array, just “n”, if i remember correctly.

  • http://www.phppatterns.com HarryF

    Harry, didn’t you mean str_replace?

    Well I meant strtr() (the example works) based on the unconfirmed theory that it would offer better performance than str_replace(). But looking at this (see str_replace vs. strtr later on) reminds me it’s best always to question “accepted performance wisdom” with benchmarks.

    What’s interesting is str_replace performs better in this case, but only significantly so if the “replace” target is a string – not an array

    Fastest:

    $str = str_replace(array( "rn","n"), "n", $str);

    About the same:

    $str = str_replace(array( "rn","n"), array("n","n"), $str);

    $str = strtr($str, array( "rn"=>"n" , "r"=>"n" ));

    Even more excitement – doing two seperate strtr’s seems to match the performance of the fastest str_replace above;


    $str = strtr($str, "rn", "n");
    $str = strtr($str, "r", "n");

    …but used pretty short $str’s so that may not scale to bigger documents.

    Side note: additional confusion is being caused in these examples by the comment feature adding slashes to quotes :(

  • http://www.procata.com/ Selkirk

    Nice article. There is also another problem with escaping the replacement parameter of preg_replace under certain conditions. preg_quote is not suitable for that task because it is geared toward the search parameter and not the replacement parameter. I talk about that and the e modifier in my blog post on preg_replace escaping.

  • http://en.journey.bg/portal.html 1magic

    Nice article series. Thanks!

  • Sean

    Great article, and thanks for the bit on eval() – I never knew of its security flaws over and above the obvious injection attacks.

    By the way, the link to sprintf() on php.net is incorrect here – it links to http://php.nett/sprintf.

  • http://www.phppatterns.com HarryF

    Great article, and thanks for the bit on eval()—I never knew of its security flaws over and above the obvious injection attacks.

    By the way, the link to sprintf() on php.net is incorrect here—it links to http://php.nett/sprintf.

    Thanks and thanks – now fixed at last.

  • asdasdasd

    What’s interesting is str_replace performs better in this case, but only significantly so if the “replace” target is a string—not an array

  • Päse

    Nice examples! But you should check your “Hunting for .jp(e)g” because if the filename has dots as separators in it the regexp won’t work ;-)

  • HarryF

    if the filename has dots as separators in it the regexp won’t work

    It would still work : /^.*.jpe?g$/i – all I’m doing there is requiring the filename ends with .jpg or .jpeg (case insenstive). otherwise you’re allowed anything you like including . characters.