The Joy of Regular Expressions [1]

Was asked recently if I knew of any good regular expressions tutorials (preferably in PHP). The question came from someone certainly smart enough to “get” regular expressions but they’d been unable to find accessible help.

Most regular expression tutorials I’ve seen are organised around teaching the syntax incrementally, which can quickly lead to mental overload. Examples commonly revolve around strings like ‘aaabbababa’…” – great if you’re writing a web crawler for Swedish pop, but confusing for anyone else. And while there are copy and paste regular expressions on-line, if you don’t know what you’re doing, using them can be worse than not at all. Do they meet your needs? Whoops! Mind the security hole…

So going to take a crack at Yet Another Regular Expressions tutorial, with a focus on doing (in PHP) while slowly introducing you to regexp (shorthand for regular expression) syntax. This is going to span a few blog posts (will keep the contents below updated) and get progressively “more interesting” – not all for beginners but if you keep up, hopefully you’ll be able to grasp it. And although it’s “Regexes and PHP”, the regex syntax I’ll be using is largely portable to other programming languages.

Part 1

Resources
Background
Why do I need regular expressions?
Grasping the Concept
Learning by Doing
Positive Matching
Expression Delimiters and Pattern Modifiers
- preg_match() return values
- preg_match_all()
Exact Match
- A Fairy Tale
- Tactics
Validating a username
- Quantifying Length

Part 2 is here

Resources

Before I dive in, a choice link or two. One of the best PHP-specific regular expressions tutorials is here – it takes the “incremental syntax” approach but manages to be reasonably friendly. It’s also pretty comprehensive. You won’t find anything new here regex-syntax-wise: just a retelling of the story and perhaps some interesting examples.

Wikipedia has of course a ton of information and links to yet more information, but probably more than you’ll want to digest in one go.

If you’re really feeling brave, you might also try the Perl regular expression tutorial. Although Perl’s API for executing regular expressions is significantly different to PHP’s, the regular expression syntax itself is almost exactly the same and the tutorial is rich with further insight.

Finally, The Regex Coach (thanks Maarten for tip off which I’d written off without trying) is an excellent tool, not just for learning but also debugging regular expressions and getting a feel for performance (i.e. pros may find it helpful also).

Some background

PHP comes with two sets regular expression functions and syntax – the POSIX extended regular expressions and the Perl Compatible Regular Expressions extension.

Once upon a time, the underlying code for these extensions was different but these days both are using the same thing PCRE engine – this gets bundled into the PHP distributions you download. The discussion here will focus purely on the Perl Compatible syntax – its more powerful and has become more-or-less a standard – once you know it, you’ll find it largely supported by most all popular programming languages, from Java to Javascript.

And note that PHP isn’t the only project using the PCRE library. While some languages have built their own implementation from scratch, you’ll find PCRE is also used in Apache, Ruby and numerous other Open Source projects that need powerful regex support at minimum effort.

Why do I need regular expressions?

…because they’re pretty much essential for anything but the most trivial text processing. By “text processing” I mean anything where you’re analysing or modifying a string of characters e.g. replacing characters like < and > with < and >, splitting a string containing semi-colons into a list of smaller strings, counting the number of times a particular word occurs etc. For these types of simple problem, you may well be able survive with basic string functionality but the trickier the task, the harder it gets to work without regular expressions. Consider validating that a user-submitted URL obeys the RFC 2396 syntax, for example – with basic string functions alone, very hard. With regular expressions it’s do-able.

Convinced? Probably not. So how about some fear and loathing for “why regular expressions?”: without regular expressions, you can’t write a secure web application. Although PHP provides other tools which can be used in simple testing of input, pretty soon you’ll have a problem for which only regular expressions make sense.

Otherwise – believe it or not – they make your life easier. If you consider writing a BBCode parser or the task of extracting all links from an HTML document, for example – regular expressions can make it a breeze (examples in another time).

Grasping the concept

Perhaps the tallest hurdle with regexes is conceptual – just what are they?

One nerdy answer is they’re a domain specific language – a “mini” programming language designed specifically for describing and matching text. Perhaps not such a useful description for beginners…

Another way to think of regexes is by analogy. Most people who’ve put together some basic database driven web application are familiar with SQL, as a language for retrieving (SELECTing) data from your RDBMS (e.g. MySQL). Regular expressions can be thought of as the same thing as SQL but instead of pulling data out of your database, you use them to pull data out of a block of text. And much like you embed SQL statements into your code (unless you’re doing some kind of ORM), you do the same with regular expressions – where you might call mysqli_query() to execute your SQL statement, you call functions like preg_match() to execute your regular expression.

Of course you can go too far with analogies, so I’ll stop there. The main point is regular expressions are instructions for your regex engine, telling it how to go about finding the characters you want from a given block of text.

Learning by doing…

Like any language, the best way to learn regular expressions is by practice and patience. The point where you start to become confident is when you’ve memorised most of the syntax, and are able to read regular expressions without having to consult the documentation.

To that end, will begin exploring the syntax using web-relevant examples (that you could perhaps re-use). There will be other approaches (including solutions that avoid regexes) but the purpose is illustrating regexes, so bear with me.

Positive Matching

The easiest place to start is with some regular expressions that literally match the text you’re looking for, without any additional regex syntax.

So an example thats a little contrived but anyway… You have a form asking a user whether they’ve read the “Terms and Conditions of Sign-up”, and you have their answer stored in the variable $answer. You now want to test whether they answered “yes” to the question – anything else is regarded as a “no”. Using the preg_match() function you could do it like this…


if ( preg_match('/yes/', $answer) ) {
    
    print "Say YES!!!n";
    
} else {
    
    print "what do you mean no?!?n";
    
}

Now allow me to overwhelm you with some details. What this code is asking is “can I find the string ‘yes’ anywhere inside the string $answer?”.

The regular expression is the first argument to preg_match() – the '/yes/'. In PHP, regular expressions are always placed inside PHP string variables (just like SQL). This is unlike some other languages, such as Javascript and Perl, regular expressions can also be “literals” e.g. (Javascript);


if ( /yes/.exec(answer) ) {
    alert("Say YES!!!");
}

In PHP this means you need to be a little careful when it comes to backslashes as well as being aware of how PHP parses strings.

Expression Delimiters and Pattern Modifiers

So what are the two forward slashes doing here?


if ( preg_match('/yes/', $answer) ) {

They are the expression delimiters marking the start and end of the regular expression. In this example it’s not clear why you need them, but the purpose is to allow inclusion of pattern modifiers at the end of the expression. Pattern modifiers are “global instructions” to the regex engine tell it to alter it’s default behaviour. I’ll look at pattern modifiers more soon but one example is the /i modifier, which tells the engine to perform case insensitive matching e.g.


if ( preg_match('/yes/i', $answer) ) {
    // etc.

By placing the /i pattern modifier at the end of the expression, I can now match both the strings ‘yes’ and ‘YES’ (and ‘YeS’ or other combinations of upper and lower case).

Note that the expression delimiter doesn’t have to be a forward slash – you can also use pretty much anything apart from a backslash or an alpha-numeric character. You just need to make sure you use the same delimiter at each end of the pattern. For example;


if ( preg_match('%yes%i', $answer) ) {
    // etc.

I’ve used the ‘%’ character instead of a forward slash to delimit the expression. This can be useful when the pattern you want to search for contains the delimiter (common if you’re matching something like a URL of a file system path) – just change the delimiter, rather than having to escape characters within the expression (more on escaping another time).

preg_match() return value

According to the PHP manual, preg_match() returns the number of times it was able to match the pattern you gave it (the first argument ‘/yes/’) against the string you are searching (the second argument $answer). So if it was unable to make any matches, it returns an integer 0, which will fail a PHP if condition. preg_match() also stops searching the moment it makes a first successful match, so will only ever return 1 at most. Now you might be wondering, if the result is either 0 or 1, why doesn’t the manual just say 0 or 1? The point it’s trying to convey is preg_match() stops as soon as it finds a match – that can be important when you’re running regexes across large documents, where performance may be significant: if you want to check a document contains a word, and the word happens to be in the first paragraph, you don’t want the regex engine scanning the entire document when it’s already found a match

Note that 0 and 1 aren’t the only returned values – if something goes wrong (like the pattern is not valid regex syntax), it will return FALSE (plus you’ll get a rude error warning) – make sure you check carefully if you generating patterns on the fly.

More on getting the actual matches out of preg_match() another time

preg_match_all()

By contrast preg_match_all(), keeps on going until it’s examined the entire text you are searching. This can illustrated with the following;


<?php

$answer1 = "no";
$answer2 = "yes";
$answer3 = "yes yes";

print preg_match('/yes/', $answer1)."n";           // Displays '0'
print preg_match('/yes/', $answer2)."n";           // Displays '1'
print preg_match('/yes/', $answer3)."n";           // Displays '1'

print preg_match_all('/yes/', $answer1, $m)."n";   // Displays '0'
print preg_match_all('/yes/', $answer2, $m)."n";   // Displays '1'
print preg_match_all('/yes/', $answer3, $m)."n";   // Displays '2'

More on preg_match() and preg_match_all() another time (such as how to get the matched text out of them).

Exact Match

Now so far, I’ve only been able to confirm that $answer contains ‘yes’ somewhere inside it. That
means if the user provides an answer like ‘Bayesian spam filter’, it will pass my test. I really want to be 100% sure that the user said exactly ‘yes’ to the terms an conditions. So I need a little more pattern syntax, namely two meta-characters…


if ( preg_match('/^yes$/', $answer) ) {
    // etc.

The ^ meta-character means “assert that we match from the start of $answer” and the $ meta-character means “assert that we match to the end of $answer“. So what the
expression is now saying is something like;

Match the word ‘yes’ but do not match anything else

Best not get hung up on the philosophical meaning of the term “meta-character” – just remember these two – ^ asserts the start of the string and $ asserts the end – combined they help you make exact matches against a complete string.

A Fairy Tale

You could also use them separately. Another contrived example (this will be the last, I promise): you have a site where users can post fairy tales, and you want to make sure every story begins “Once upon a time”;


if ( !preg_match('/^Once upon a time/', $story) ) {
    die("This is not how a real fairy starts!n");
}

Then to make sure they finish with “happily ever after”, you add…


if ( !preg_match('/happily ever after$/', $story) ) {
    die("Don't give me sob stories!n");
}

More meta-characters in a moment.

A Note on Tactics

Now some regex masters can build giant expressions as a single pattern by hand. For the rest of us, a smarter approach is to keep expressions small, doing only a single task. Once regexes start to grow, they can become extremely hard to debug when they stop functioning as expected.

As the previous example illustrates, you can get a lot of mileage out of repeated smaller patterns, the downside being potential performance overhead, depending on what you’re doing, and extra lines of code.

If you do find your regexes growing, you can make them more readable using the /x pattern modifier, which allows you to split a regex across multiple lines and include comments – I’ll be illustrating that another time, as well as approaches that can help you process text with regexes in stages.

Validating a username

Moving on to an example much nearer to home, one classic beginners mistake, when adding a user authentication system to a web app, is allowing users to choose just whatever username they please when they register. Pretty soon some smart guy comes along and registers themselves as something like ‘ admin’ (note the initial space character) and proceeds to make confusing posts all over your site, and in the worst case exploiting poorly constructed code.

In general it’s a good idea to be very restrictive on key identifiers such as usernames so this is a good opportunity to introduce a special kind of regex meta-character: the character
class. In addition to the “built-in” meta-characters, such as the ^ and $ characters you’ve seen, you can also define your own meta-characters by using a character class, which is used to represent a single character. Jumping right to an example…


if ( !preg_match('/^[a-zA-Z0-9_]+$/', $username) ) {
    die("Invalid username: only alpha numeric characters allowed.");
}

My character class here is [a-zA-Z0-9_] – it matches any character which obeys one of the following conditions;

It’s a lower case character between ‘a’ and ‘z’
… or it’s an upper case character between ‘A’ and ‘Z’
… or it’s a digit between ‘0’ and ‘9’
… or it’s just an ‘_’ underscore character.

The minus sign ‘-‘ which appears in the character class specifies a range and you’ll notice that between, say, ‘a’ and ‘z’ in the table of ASCII characters, you have all the lower case letters of the alphabet, nicely sorted.

Quantifying Length

You may also have noticed I sneaked in another meta-character into the last example – the + quantifier.

The + meta-character refers to the preceding character (or meta-character) in the pattern and modifies it’s meaning to “one or more of this character” – it quantifies it’s length. So my example…


if ( !preg_match('/^[a-zA-Z0-9_]+$/', $username) ) {
    // etc.

…requires that usernames are at least one character long but places no restriction on the maximum length. Now that’s actually not such a smart idea – usernames probably need to be at least 5 characters long to be readable and, given space limits in a VARCHAR column and screen resolutions, it’s probably wise to impose a maximum length; say 20 characters.

Instead of use the + quantifier, I can use a min/max quantifier that I define myself, using the curly brackets { };


if ( !preg_match('/^[a-zA-Z0-9_]{5,20}$/', $username) ) {
    // etc.

Just like the + quantifier, the min/max quantifiers apply to the preceding character (or meta-character) in the pattern.

And here you’re starting to see some of the power of regular expressions, over alternative approaches. My username check now looks at not just the contents of the username (which characters it contains) but also it’s length, with a single statement.

Just so you know, the min/max quantifiers allow you to do other length checks, depending on whether you omit the min or max e.g.


# Username must be _at least_ 5 characters long but no max limit...
if ( !preg_match('/^[a-zA-Z0-9_]{5,}$/', $username) ) {
    // etc.

…and…


# Username must be _exactly_ 10 characters long...
if ( !preg_match('/^[a-zA-Z0-9_]{10}$/', $username) ) {
    // etc.

OK – enough for part 1. Nothing too intense so far – more regex action next time…