The Joy of Regular Expressions [1]

Was asked recently if I knew of any good regular expressions tutorials (preferably in PHP). The question came from someone certainly smart enough to “get” regular expressions but they’d been unable to find accessible help.

Most regular expression tutorials I’ve seen are organised around teaching the syntax incrementally, which can quickly lead to mental overload. Examples commonly revolve around strings like ‘aaabbababa’…” – great if you’re writing a web crawler for Swedish pop, but confusing for anyone else. And while there are copy and paste regular expressions on-line, if you don’t know what you’re doing, using them can be worse than not at all. Do they meet your needs? Whoops! Mind the security hole…

So going to take a crack at Yet Another Regular Expressions tutorial, with a focus on doing (in PHP) while slowly introducing you to regexp (shorthand for regular expression) syntax. This is going to span a few blog posts (will keep the contents below updated) and get progressively “more interesting” – not all for beginners but if you keep up, hopefully you’ll be able to grasp it. And although it’s “Regexes and PHP”, the regex syntax I’ll be using is largely portable to other programming languages.

Contents

Part 1

Part 2 is here

Resources

Before I dive in, a choice link or two. One of the best PHP-specific regular expressions tutorials is here – it takes the “incremental syntax” approach but manages to be reasonably friendly. It’s also pretty comprehensive. You won’t find anything new here regex-syntax-wise: just a retelling of the story and perhaps some interesting examples.

Wikipedia has of course a ton of information and links to yet more information, but probably more than you’ll want to digest in one go.

If you’re really feeling brave, you might also try the Perl regular expression tutorial. Although Perl’s API for executing regular expressions is significantly different to PHP’s, the regular expression syntax itself is almost exactly the same and the tutorial is rich with further insight.

Finally, The Regex Coach (thanks Maarten for tip off which I’d written off without trying) is an excellent tool, not just for learning but also debugging regular expressions and getting a feel for performance (i.e. pros may find it helpful also).

Some background

PHP comes with two sets regular expression functions and syntax – the POSIX extended regular expressions and the Perl Compatible Regular Expressions extension.

Once upon a time, the underlying code for these extensions was different but these days both are using the same thing PCRE engine – this gets bundled into the PHP distributions you download. The discussion here will focus purely on the Perl Compatible syntax – its more powerful and has become more-or-less a standard – once you know it, you’ll find it largely supported by most all popular programming languages, from Java to Javascript.

And note that PHP isn’t the only project using the PCRE library. While some languages have built their own implementation from scratch, you’ll find PCRE is also used in Apache, Ruby and numerous other Open Source projects that need powerful regex support at minimum effort.

Why do I need regular expressions?

…because they’re pretty much essential for anything but the most trivial text processing. By “text processing” I mean anything where you’re analysing or modifying a string of characters e.g. replacing characters like < and > with &lt; and &gt;, splitting a string containing semi-colons into a list of smaller strings, counting the number of times a particular word occurs etc. For these types of simple problem, you may well be able survive with basic string functionality but the trickier the task, the harder it gets to work without regular expressions. Consider validating that a user-submitted URL obeys the RFC 2396 syntax, for example – with basic string functions alone, very hard. With regular expressions it’s do-able.

Convinced? Probably not. So how about some fear and loathing for “why regular expressions?”: without regular expressions, you can’t write a secure web application. Although PHP provides other tools which can be used in simple testing of input, pretty soon you’ll have a problem for which only regular expressions make sense.

Otherwise – believe it or not – they make your life easier. If you consider writing a BBCode parser or the task of extracting all links from an HTML document, for example – regular expressions can make it a breeze (examples in another time).

Grasping the concept

Perhaps the tallest hurdle with regexes is conceptual – just what are they?

One nerdy answer is they’re a domain specific language – a “mini” programming language designed specifically for describing and matching text. Perhaps not such a useful description for beginners…

Another way to think of regexes is by analogy. Most people who’ve put together some basic database driven web application are familiar with SQL, as a language for retrieving (SELECTing) data from your RDBMS (e.g. MySQL). Regular expressions can be thought of as the same thing as SQL but instead of pulling data out of your database, you use them to pull data out of a block of text. And much like you embed SQL statements into your code (unless you’re doing some kind of ORM), you do the same with regular expressions – where you might call mysqli_query() to execute your SQL statement, you call functions like preg_match() to execute your regular expression.

Of course you can go too far with analogies, so I’ll stop there. The main point is regular expressions are instructions for your regex engine, telling it how to go about finding the characters you want from a given block of text.

Learning by doing…

Like any language, the best way to learn regular expressions is by practice and patience. The point where you start to become confident is when you’ve memorised most of the syntax, and are able to read regular expressions without having to consult the documentation.

To that end, will begin exploring the syntax using web-relevant examples (that you could perhaps re-use). There will be other approaches (including solutions that avoid regexes) but the purpose is illustrating regexes, so bear with me.

Positive Matching

The easiest place to start is with some regular expressions that literally match the text you’re looking for, without any additional regex syntax.

So an example thats a little contrived but anyway… You have a form asking a user whether they’ve read the “Terms and Conditions of Sign-up”, and you have their answer stored in the variable $answer. You now want to test whether they answered “yes” to the question – anything else is regarded as a “no”. Using the preg_match() function you could do it like this…


if ( preg_match('/yes/', $answer) ) {
    
    print "Say YES!!!n";
    
} else {
    
    print "what do you mean no?!?n";
    
}

Now allow me to overwhelm you with some details. What this code is asking is “can I find the string ‘yes’ anywhere inside the string $answer?”.

The regular expression is the first argument to preg_match() – the '/yes/'. In PHP, regular expressions are always placed inside PHP string variables (just like SQL). This is unlike some other languages, such as Javascript and Perl, regular expressions can also be “literals” e.g. (Javascript);


if ( /yes/.exec(answer) ) {
    alert("Say YES!!!");
}

In PHP this means you need to be a little careful when it comes to backslashes as well as being aware of how PHP parses strings.

Expression Delimiters and Pattern Modifiers

So what are the two forward slashes doing here?


if ( preg_match('/yes/', $answer) ) {

They are the expression delimiters marking the start and end of the regular expression. In this example it’s not clear why you need them, but the purpose is to allow inclusion of pattern modifiers at the end of the expression. Pattern modifiers are “global instructions” to the regex engine tell it to alter it’s default behaviour. I’ll look at pattern modifiers more soon but one example is the /i modifier, which tells the engine to perform case insensitive matching e.g.


if ( preg_match('/yes/i', $answer) ) {
    // etc.

By placing the /i pattern modifier at the end of the expression, I can now match both the strings ‘yes’ and ‘YES’ (and ‘YeS’ or other combinations of upper and lower case).

Note that the expression delimiter doesn’t have to be a forward slash – you can also use pretty much anything apart from a backslash or an alpha-numeric character. You just need to make sure you use the same delimiter at each end of the pattern. For example;


if ( preg_match('%yes%i', $answer) ) {
    // etc.

I’ve used the ‘%’ character instead of a forward slash to delimit the expression. This can be useful when the pattern you want to search for contains the delimiter (common if you’re matching something like a URL of a file system path) – just change the delimiter, rather than having to escape characters within the expression (more on escaping another time).

preg_match() return value

According to the PHP manual, preg_match() returns the number of times it was able to match the pattern you gave it (the first argument ‘/yes/’) against the string you are searching (the second argument $answer). So if it was unable to make any matches, it returns an integer 0, which will fail a PHP if condition. preg_match() also stops searching the moment it makes a first successful match, so will only ever return 1 at most. Now you might be wondering, if the result is either 0 or 1, why doesn’t the manual just say 0 or 1? The point it’s trying to convey is preg_match() stops as soon as it finds a match – that can be important when you’re running regexes across large documents, where performance may be significant: if you want to check a document contains a word, and the word happens to be in the first paragraph, you don’t want the regex engine scanning the entire document when it’s already found a match

Note that 0 and 1 aren’t the only returned values – if something goes wrong (like the pattern is not valid regex syntax), it will return FALSE (plus you’ll get a rude error warning) – make sure you check carefully if you generating patterns on the fly.

More on getting the actual matches out of preg_match() another time

preg_match_all()

By contrast preg_match_all(), keeps on going until it’s examined the entire text you are searching. This can illustrated with the following;


<?php

$answer1 = "no";
$answer2 = "yes";
$answer3 = "yes yes";

print preg_match('/yes/', $answer1)."n";           // Displays '0'
print preg_match('/yes/', $answer2)."n";           // Displays '1'
print preg_match('/yes/', $answer3)."n";           // Displays '1'

print preg_match_all('/yes/', $answer1, $m)."n";   // Displays '0'
print preg_match_all('/yes/', $answer2, $m)."n";   // Displays '1'
print preg_match_all('/yes/', $answer3, $m)."n";   // Displays '2'

More on preg_match() and preg_match_all() another time (such as how to get the matched text out of them).

Exact Match

Now so far, I’ve only been able to confirm that $answer contains ‘yes’ somewhere inside it. That
means if the user provides an answer like ‘Bayesian spam filter’, it will pass my test. I really want to be 100% sure that the user said exactly ‘yes’ to the terms an conditions. So I need a little more pattern syntax, namely two meta-characters


if ( preg_match('/^yes$/', $answer) ) {
    // etc.

The ^ meta-character means “assert that we match from the start of $answer” and the $ meta-character means “assert that we match to the end of $answer“. So what the
expression is now saying is something like;

Match the word ‘yes’ but do not match anything else

Best not get hung up on the philosophical meaning of the term “meta-character” – just remember these two – ^ asserts the start of the string and $ asserts the end – combined they help you make exact matches against a complete string.

A Fairy Tale

You could also use them separately. Another contrived example (this will be the last, I promise): you have a site where users can post fairy tales, and you want to make sure every story begins “Once upon a time”;


if ( !preg_match('/^Once upon a time/', $story) ) {
    die("This is not how a real fairy starts!n");
}

Then to make sure they finish with “happily ever after”, you add…


if ( !preg_match('/happily ever after$/', $story) ) {
    die("Don't give me sob stories!n");
}

More meta-characters in a moment.

A Note on Tactics

Now some regex masters can build giant expressions as a single pattern by hand. For the rest of us, a smarter approach is to keep expressions small, doing only a single task. Once regexes start to grow, they can become extremely hard to debug when they stop functioning as expected.

As the previous example illustrates, you can get a lot of mileage out of repeated smaller patterns, the downside being potential performance overhead, depending on what you’re doing, and extra lines of code.

If you do find your regexes growing, you can make them more readable using the /x pattern modifier, which allows you to split a regex across multiple lines and include comments – I’ll be illustrating that another time, as well as approaches that can help you process text with regexes in stages.

Validating a username

Moving on to an example much nearer to home, one classic beginners mistake, when adding a user authentication system to a web app, is allowing users to choose just whatever username they please when they register. Pretty soon some smart guy comes along and registers themselves as something like ‘ admin’ (note the initial space character) and proceeds to make confusing posts all over your site, and in the worst case exploiting poorly constructed code.

In general it’s a good idea to be very restrictive on key identifiers such as usernames so this is a good opportunity to introduce a special kind of regex meta-character: the character
class
. In addition to the “built-in” meta-characters, such as the ^ and $ characters you’ve seen, you can also define your own meta-characters by using a character class, which is used to represent a single character. Jumping right to an example…


if ( !preg_match('/^[a-zA-Z0-9_]+$/', $username) ) {
    die("Invalid username: only alpha numeric characters allowed.");
}

My character class here is [a-zA-Z0-9_] – it matches any character which obeys one of the following conditions;

  • It’s a lower case character between ‘a’ and ‘z’
  • … or it’s an upper case character between ‘A’ and ‘Z’
  • … or it’s a digit between ‘0’ and ‘9’
  • … or it’s just an ‘_’ underscore character.

The minus sign ‘-‘ which appears in the character class specifies a range and you’ll notice that between, say, ‘a’ and ‘z’ in the table of ASCII characters, you have all the lower case letters of the alphabet, nicely sorted.

Quantifying Length

You may also have noticed I sneaked in another meta-character into the last example – the + quantifier.

The + meta-character refers to the preceding character (or meta-character) in the pattern and modifies it’s meaning to “one or more of this character” – it quantifies it’s length. So my example…


if ( !preg_match('/^[a-zA-Z0-9_]+$/', $username) ) {
    // etc.

…requires that usernames are at least one character long but places no restriction on the maximum length. Now that’s actually not such a smart idea – usernames probably need to be at least 5 characters long to be readable and, given space limits in a VARCHAR column and screen resolutions, it’s probably wise to impose a maximum length; say 20 characters.

Instead of use the + quantifier, I can use a min/max quantifier that I define myself, using the curly brackets { };


if ( !preg_match('/^[a-zA-Z0-9_]{5,20}$/', $username) ) {
    // etc.

Just like the + quantifier, the min/max quantifiers apply to the preceding character (or meta-character) in the pattern.

And here you’re starting to see some of the power of regular expressions, over alternative approaches. My username check now looks at not just the contents of the username (which characters it contains) but also it’s length, with a single statement.

Just so you know, the min/max quantifiers allow you to do other length checks, depending on whether you omit the min or max e.g.


# Username must be _at least_ 5 characters long but no max limit...
if ( !preg_match('/^[a-zA-Z0-9_]{5,}$/', $username) ) {
    // etc.

…and…


# Username must be _exactly_ 10 characters long...
if ( !preg_match('/^[a-zA-Z0-9_]{10}$/', $username) ) {
    // etc.

OK – enough for part 1. Nothing too intense so far – more regex action next time…

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • WarpNacelle

    Excellent! Just what I needed. Very clear and just the right size. I’ll be back for the rest. Thank you!

  • Ben

    Excellent as usual!

  • random

    Great work Harry. I look forward to seeing whether you can coherently explain lookaround, recursive expressions etc. That’s where these tutorials tend to crash and burn…

  • Hendrik

    Great read. Clear and catchy as always. Just one remark: I used to have problems using ‘The Regex Coach’ as it isn’t that stable-had some freezes and odd behaviour. So I switched to Expresso (http://www.ultrapico.com/Expresso.htm). It’s free of charge. Though, a valid e-mail address has to be given for obtaining a license code.

  • http://www.phppatterns.com HarryF

    I look forward to seeing whether you can coherently explain lookaround, recursive expressions etc. That’s where these tutorials tend to crash and burn…

    I may well crash and burn there as well ;) We’ll see

    I used to have problems using ‘The Regex Coach’ as it isn’t that stable-had some freezes and odd behaviour. So I switched to Expresso (http://www.ultrapico.com/Expresso.htm). It’s free of charge. Though, a valid e-mail address has to be given for obtaining a license code.

    That’s for tip off – have actually had similar problems with Regex Coach which I’d assumed was because I was running it under Linux – exploded on things like pasting in bigger blocks of sample text.

  • Amol kulkarni

    Really it a best article for regex beginners. It appriciate to use regex instead of string manipulations and operationa

  • foofoonet

    Hi,

    We need a definite solid resource to point new users to from the php forum – its fitting if that user guide is on SP too.

    Glad to see you mentioned regex coach, and alternatives.

    In a similar vein, please consider this suggestion : a users guide to debugging a script from how to use echo, and print_r() to look inside variables… badly missing tutorial that.

  • foofoonet

    Hi,

    We need a definite and solid regex resource to point new users to from the php forum – it’s fitting that user guide is on SP too.

    Glad to see you mentioned regex coach, and alternatives.

    In a similar vein, please consider this suggestion : a users guide to debugging a script from how to use echo, and print_r() to look inside variables… badly missing tutorial that.

  • http://www.floogy.com Madmac

    Brilliant. Haven’t learned new things yet, but this was a great refresher and I cannot wait for the next! As an aside, I totally agree with foofoonet – most of the calls for help I get from beginner PHP programmers could have been so easily solved with a little echo(), print_r() or die(mysql_error().$query);

  • http://www.deanclatworthy.com Dean C

    Great post for all the regular expression newbies :)!

    “without regular expressions, you can’t write a secure web application”

    I see what you’re trying to say but of course you can validate input without regular expressions, and of course you can have a secure application. It’s just a lot more difficult ;)

  • http://www.sudokumadness.com/ coffee_ninja

    The one regular expression that I would love to see is an email address validation regex which actually follows the RFC 2281 & 2282 :) All too many validation schemes forget that a “.” is a legal character in the mailbox name, not to mention even more obscure legalities like quotes and whitespace.

  • http://www.phppatterns.com HarryF

    The one regular expression that I would love to see is an email address validation regex which actually follows the RFC 2281 & 2282

    Are you sure you’ve got the right RFCs? RFC2281 is for Cisco Hot Standby router protocol and 2282 is also something networky. When I look at this Perl module it makes me think that truly validating an email address is something where you need more than regular expressions alone – this module uses full blown parsing, feeding in the grammar rules to validate.

  • http://www.genesisevolved.com Quaint

    Hey Harry!
    It would sure be nice to publish the RegExp tutorial above in a full article (instead of blog) once it’s finished! It certainly looks detailed and good enough to qualify and a good reg exp tutorial is definitaly useful! Keep it up (as always :))

  • malikyte

    The regular expression that explicitly follows the RFC 2281 and 2282, is approximately 1 full page of text using courier new, 11pt font, 1″ margins, letter sized…just to give you an idea. It’s not a truly viable use for a REGEX. Running that REGEX would probably cause the parsing engine to choke.

  • malikyte

    Just to add to the above:
    There are many things you can do with REGEX; but just as many that you shouldn’t. Since REGEX is from Perl, I’ll take a lovely acronym commonly used in Perl programming: “TMTOWTDI”

  • http://www.phppatterns.com HarryF

    http://en.wikipedia.org/wiki/E-mail_address – think you both mean RFC 2821 and RFC 2822 (successors of RFC 822) not RFC 2281 and 2282.

  • malikyte

    Perhaps, I simply assumed it was the correct RFC (should be RFC822, the old standard). I’m not sure whether I’ve seen the REGEX for 2821 or 2822. My point still stands about the use of REGEX though.

    Here’s a more informative discussion on using a REGEX for email – linked to from a site marketing for a great Win/Linux tool on learning REGEX’s:
    http://www.regular-expressions.info/email.html

    …and the Perl module using it:
    http://ex-parrot.com/%7Epdw/Mail-RFC822-Address.html

  • Ize

    Great tutorial! I’m looking forward to seeing the follow-ups.

  • http://www.redcow.ca/ Ray Oliver

    Awesome! I too have been waiting for something like this, looking forward to reading this and other posts about PHP regex.

  • Andrei

    You can also download the slides for my Regex Clinic tutorial that I’ve been giving at conferences:

    http://www.gravitonic.com/do_download.php?download_file=talks/phptek-2006/regex-clinic_phptek2006.pdf

  • http://www.phppatterns.com HarryF

    Wow – outstanding talk Andrei – wish I’d been there. Don’t think this series will be going quite that far and I learnt some stuff reading that anyway.

  • Manuel

    Harry,

    you might have wanted to describe the limitations of regular expressions. Coming from a theoretical backround, I find it important to line out that you will never be able to parse anything that allows balanced bracket terms (as about all programming languages or HTML allows you to do).

    Those of you who are interested in a bit of theory might find the following wikipedia useful as a starting point: http://en.wikipedia.org/wiki/Chomsky_grammar

  • http://www.errewf.it RaS!

    Great article Harry, very well done and easy to learn. thanks!

  • http://en.journey.bg/portal.html 1magic

    can you please make these blogs printable… I realy would like to have it on hard copy

  • http://www.phppatterns.com HarryF

    you might have wanted to describe the limitations of regular expressions. Coming from a theoretical backround, I find it important to line out that you will never be able to parse anything that allows balanced bracket terms (as about all programming languages or HTML allows you to do).

    Will get there (hopefully) – it’s definately not going as far as discussing Chomsky grammars and parsing theory but I will explore some practical tricks using regexes for simple lexing, good enough for a hand coded parser and the type of match/transform operations common to stuff like BBCode and wiki markup.

    There’s actually a dirth of useful discussion that bridges the gap between parsing theory and practice.

    Would be really interested to see this discussed more fully, for example. In addition to the points the author makes, there’s the practical usability concern that “syntax error: can’t display this page” is not acceptable for stuff like wiki markup – you have to allow room for users to make mistake and still display “something” so they can get some idea how to fix it.

  • http://www.phppatterns.com HarryF

    can you please make these blogs printable… I realy would like to have it on hard copy

    Try a print preview – at least in Firefox this CSS is stripping the two side panels and the top menu – seems like a serviceable print out to me.

  • Andrei

    Manuel,

    Harry,

    you might have wanted to describe the limitations of regular expressions. Coming from a theoretical backround, I find it important to line out that you will never be able to parse anything that allows balanced bracket terms (as about all programming languages or HTML allows you to do).

    You can certainly do that with the recursive matching feature in PCRE.

    ( ( (?>[^()]+) | (?R) )* )

    Assuming extended option is set, this will match nested parenthesis.

  • Huntington

    Does anyone know if it’s possible to use a regular expression with a PHP require statemnet? I use the require statement to pull in code from a .inc file into an .html file. I’d like to be able to pull in code from only a portion of the .inc file. For example, only the first three list items in the file: <li>one</li><li>two</li><li>three</li>

  • Bustergates

    Greetings: I have the crudest command of software authoring tools to create, edit, upload and publish content to a web site. I know the skills I’ve acquired by trial and error are pale next to those who actually have experience and a client list. In the event that the economy picks up and I can post a good enough looking resume to capture a job or an account as a web master, what should I do to insure I don’t stumble and lose these precious first assignment? Thanks.

  • SB6 Designz

    great work man

    way off topic but if anyone here has to do with the forum can you please activate my account because the email is not sending

  • Buy Tramadol online

    Tramadol is used to relieve moderate to moderately severe pain. It also may be used to treat pain caused by surgery and chronic conditions such as cancer or joint pain. Tramadol works by decreasing the brain’s perception and response to pain. It also reduces the size or magnitude of the pain signal passed from one nerve to another. This medication is sometimes prescribed for other uses; ask your doctor or pharmacist for more information.

    FedEx next day delivery, free prescription with your order and 24/7 customer service.

  • DAVID

    THANK YOU VERY MUCH
    YOU HELPED ME A LOT

  • bahodir

    say, i want to have only one underscore (_) in a usename. How do I do it?

  • emmilely

    Thank you so much, this was extremely helpful and very clearly stated. I am just learning about regular expressions, and this site is one of the very few that actually explained them in a manner beginners could understand.