PHP
Article

Using PHP Regular Expressions

By Jason Pasnikowski

 

^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$

It makes all the sense of ancient Egyptian hieroglyphics to you, although those little pictures at least look like they have meaning. But this… this looks like gibberish. What does it mean? It means oleomarg32@hotmail.com, Fiery.Rebel@veneuser.info, robustlamp+selfmag@gmail.ca,
or nearly any other simple email address because this is a pattern written in a language that describes how to match text in strings. When you’re looking to go beyond straight text matches, like finding “stud” in “Mustard” (which would fail btw), and you need a way to “explain” what you’re looking for because each instance may be different, you’ve come to need Regular Expressions, affectionately called regex.

Intro to Regex Notation

To get your feet wet, let’s take the above example and break it down piece by piece.

^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
The beginning of a line can be detected similar to carriage returns, even though it isn’t really an invisible character. Using ^ tells the regex engine that the match must start at the beginning of the line.

^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
Instead of specifying each and every character of the alphabet, we have a shorthand that gives a range. Usually it is case sensitive so you’ll have to specify both an uppercase and lowercase range.

^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
The same goes for numbers; we can simply shorten them to a range instead of writing all 10 digits.

^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
These are the special characters we’re allowing: a dash, underscore, dot, plus-sign, and percent-sign.

^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
The brackets surrounding our ranges effectively take everything you’ve put between them to create your own custom wildcard. Our “wildcard” is capable of matching any letter A-Z in either uppercase or lowercase, a digit 0-9, or one of our special punctuation characters.

^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
This is a quantifier; it modifies how many times the previous character should match, in this case, the previous character is the set within brackets. + means “at least one,” so in our example, after matching the beginning of the string, we have to have at least one of the characters within the brackets.

At this point we can match (given the sample email addresses from earlier) oleomarg23, Fiery.Rebel, and robustlamp+selfmag. Something like @SodaCanDrive.com would fail because we must have at least one of the characters in the bracketed set at the very beginning of the text.

In addition to + as a quantifier, there is * which is almost identical except that it will match if there are no matches at all. If we replaced the first + quantifier in the sample with * and had this:

^[A-Za-z0-9-_.+%]*@[A-Za-z0-9-.]+.[A-Za-z]{2,4}

it would have successfully matched the string @SodaCanDrive.com as we are effectively telling the regex engine to keep matching until it comes across a character not in the set, even if there aren’t any.

Back to our original pattern…

^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
The @ matches literally, so we now we’ve matched oleomarg23@, Fiery.Rebel@, and robustlamp+selfmag@. The text greencandelabra.com fails because it doesn’t have an at-sign!

^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
This portion of the expression is similar to what we matched before the at-sign except this time we’re not allowing the underscore, plus-sign, or percent-sign. Now we’re up to oleomarg23@hotmail.com, Fiery.Rebel@veneuser.info and robustlamp+selfmag@gmail.ca. gnargly3.1415@pie_a_la_mode.com would only match up to gnarly3.1415@pie.

^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
Here we have an escaped dot so as to match it literally. Note the plus-sign matched literally when it was inside brackets, but outside it had special meaning as a quantifier. Outside the brackets, the dot has to be escaped or it will be treated as the wildcard character; inside the brackets, a dot means a dot.

Uh oh! Since we already matched the .com, .info and .ca, it would seem like the match would fail because we don’t have any more dots. But regex is smart: the matching engine tries backtracking to find the match. So now we’re back to oleomarg23@hotmail., Fiery.Rebel@veneuser. and robustlamp+selfmag@gmail..

At this point, gnargly3.1415@pie_a_la_mode.com fails because the character after what’s matching so far is not a dot. drnddog@chewwed.legs.onchair continues as drnddog@chewwed.legs..

^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
Remember how we made our own custom wildcard using brackets? We can do a similar thing with braces to make custom quantifiers. {2,4} means to match at least two times but no more than four times. If we only wanted to match exactly two times, we would write {2}. We could handle any quantity up to a maximum of four with {0,4}. {2,} would match a minimum of two.

{2,4} is our special quantifier that limits the last wildcard match to any 2, 3, or 4 letters or dots. We’ve nearly fully matched oleomarg23@hotmail.com, Fiery.Rebel@venuser.info and robustlamp+selfmag@gmail.ca. drnddog@chewwed.legs.onchair has to goes backwards further to drnddog@chewwed.legs to make the match.

We just have one more to go…

^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
$ is the counter-part to ^. Just as ^ does for the start of the line, $ anchors the match to the end of the line. Our examples all match now, and drnddog@chewwed.legs.onchair fails because there isn’t 2, 3, or 4 letters preceded by a dot at the end of the string.

Regexs in PHP

It’s all well and good to have this basic understanding of the notation used by regular expressions, but we still need to know how to apply it in the context of PHP to actually do something productive, so let’s look at the function preg_match(), preg_replace() and preg_match_all().

preg_match()

To validate a form field for an email address, we’d use preg_match():

<?php
if (preg_match('/^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$/',
    $_POST["emailAddy"])) {
    echo "Email address accepted";
}
else {
    echo "Email address is all broke.";
}

If a match is found, preg_match() returns 1, otherwise 0. Notice that we added slashes to the beginning and end of the regex. These are used as delimiters to show the function where the regular expression begins and ends. You may ask, “But Jason, isn’t that what the quotes are for?” Let me assure you that there is more to it, as I will explain shortly.

preg_replace()

To find an email address and add formating, we would use preg_replace():

<?php
$formattedBlock = preg_replace(
    '/([A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4})/U',
    "<b>\1</b>", $blockOText);

Here’s that explanation that was promised: we’ve placed a U after the ending delimiter as a flag that modifies how the regex matches. We’ve seen how regex matches are greedy, gobbling up as many characters as it can and only backtracking if it has to. U makes the regex “un-greedy.” Without it, the string tweedle@dee.com-and-tweedle@dum.com would match as one. But by making it un-greedy, we tell it to find the shortest matching pattern… just tweedle@dee.com.

Did you notice we also wrapped the the whole expression in parentheses? This causes the regex engine to capture a copy of the text that matches the expression between the parenthesis which we can reference with a back-reference (1). The second argument to preg_replace() is telling the function to replace the text with an opening bold tag, whatever matched the pattern between the first set of parenthesis, and a closing bold tag. If there were other sets of parenthesis, they could be referenced with 2, 3, etc. depending on their position.

preg_match_all()

To scan some text and extract an array of all the email addresses found in it, preg_match_all() is our best choice:

<?php
$matchesFound = preg_match_all(
    '/([a-z0-9-_.+%]+@[a-z0-9-.]+.[a-z]{2,4})/Ui',
    $articleWithEmailAddys, $listOfEmails);
if ($matchesFound) {
    foreach ($listOfEmails[0] as $foundEmail) {
        echo $foundEmail . "<br>";
    }
}

preg_match_all() returns how many matches it found, and sticks those matches into the variable reference we supplied as the third argument. It actually creates a multi-dimensional array in which the matches we’re looking for are found at index 0.

In addition to the U modifier, we provided i which instructs the regex engine we want the pattern to be applied in a case-insensitive manner. That is, /a/i would match both a lower-case A and an upper-case A (or /A/i would work equally well for that matter since the modifier is asking the engine to be case-agnostic). This allows us to write things like [a-z0-9] in our expression now instead of [A-Za-z0-9] which makes it a little shorter and easier to grok.

Well that about wraps things up. While there is a lot more you can do using regular expressions involving look-ahead, look-behind, and more intricate examples of back-references, all of which you can find in PHP’s online documentation, hopefully you have plenty to work with that will serve you for many scripts just from this article.

Image via Boris Mrdja / Shutterstock

More:

Free Guide:

7 Habits of Successful CTOs

"What makes a great CTO?" Engineering skills? Business savvy? An innate tendency to channel a mythical creature (ahem, unicorn)? All of the above? Discover the top traits of the most successful CTOs in this free guide.

  • Igor

    Very great explanation. Brilliant!
    Thanks.

  • Dave H

    Nice introduction but here’s a few points:

    “.” in a regex matches any single character so it needs to be escaped like this: “.”.

    I guess it’s just an example but “{2,4}” is not enough to match all TLDs, such as .museum and .travel (not that many people use them) and “[A-Za-z0-9-]” is not enough to match domains such as £.com.

    Due to the complexities of email addresses, instead of creating your own regex for emails in PHP, you should use filter_var($email, FILTER_VALIDATE_EMAIL)

    • http://zaemis.blogspot.com Timothy Boronczyk

      Indeed the dot generally matches almost any character (it is shorthand for [^rn]), so for it to match literally outside of a range then yes it must be escaped. But within a range (inside brackets), the semantics of many special characters are changed. The dot will match a literal dot when it is used within brackets so it doesn’t need escaping there.

      • Dave H

        I never knew that. Thanks. It seems as though the only metacharacters inside a range are “]”, “” “^” and “-“.

    • http://WebsiteURL Jason Pasnikowski

      I was aware of the limitations to this regex, but it still seemed a good, sufficiently-practical example to work with for helping those trying to understand rexeg. A more thorough regex was considered, but left alone for simplicity. However, it is good that you mention the caveats so everyone is aware that this is not a be-all solution, so thank you! :-)

  • http://gilbert.im/ Gilberto Ramos

    Excellent! =)
    Clear explanation.

  • Brian

    A great explanation but it would reject .museum and .travel, which are both current top-level domains.
    Also, totally different expressions are needed for addresses in the Internationalized country code top-level domains being introduced, such as Algeria, China, etc.
    Brian

  • Nikita Popov

    I’m all in favor of making the people less ignorant about regular expressions – but why do you try to teach them with the worst example of them all? **Email addresses are hard**. Your regex will fail on oh-so-many valid emails. See http://www.regular-expressions.info/email.html for a discussion on this topic.

    Next time, recommend your readers to use filter_var instead, please!

    • Richard Quadling

      That site is maintained by the author of RegexBuddy and RegexMagic and is the author of http://shop.oreilly.com/product/9780596520694.do

      I thoroughly recommend this book. It _is_ a cookbook. So, you have a task to complete using Regex, you look up that task in the index and find your “recipe”.

      And I do have to agree that the regex for email addresses is notoriously difficult. Adding to the comments already made about .travel and .museum, you are going to have all the unicode domain names to deal with. And then you also have punycode : http://davidmichaelthompson.com/2010/04/13/domain-names-unicode-punycode/

      Having said all of that, the use of regex in PHP is pretty fundamental stuff. And using the PCRE (preg_xxx) rather than the POSIX (ereg_xxx) is the right thing too. The ereg extension is not binary safe and is a deprecated extension as of PHP 5.3.0

  • Randall Stewart

    This always trips me up. So here is a summary in case it helps anyone else. JavaScript can, to a variable, assign a string *OR* a regex, depending on the delimiter:
    var str = “Thanks Jason!”
    var re = /[A-Za-z!]+/

    But PHP requires you to define a regex *AS* a string (using both delimiters):
    $re = “/[A-Za-z!]+/”
    As Jason shows in his examples, In PHP this “dual delimiter” applies whether you’re assigning the regex to a variable, then using the variable as a function parameter, or using the regex-as-string directly in the function call.

  • Laura

    You’ve managed to completely demystify RegEx in 5 minutes. Brill!

  • Richard Quadling

    I’ve been a long time user of a product called RegexBuddy (www.regexbuddy.com). It is a windows app that allows you to design your regex using a point and click interface. It covers a LOT of the different flavours of regex (not everyone is using PCRE) and it can generate language specific code from the regex.

    But the greatest feature for me is to be able to describe a regex in relatively simple English. The regex /^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$/ gets explained as …
    Options: case insensitive; ^ and $ match at line breaks
    Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
    Match a single character present in the list below «[A-Za-z0-9-_.+%]+»
    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
    A character in the range between “A” and “Z” «A-Z»
    A character in the range between “a” and “z” «a-z»
    A character in the range between “0” and “9” «0-9»
    One of the characters “-_.+%” «-_.+%»
    Match the character “@” literally «@»
    Match a single character present in the list below «[A-Za-z0-9-.]+»
    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
    A character in the range between “A” and “Z” «A-Z»
    A character in the range between “a” and “z” «a-z»
    A character in the range between “0” and “9” «0-9»
    One of the characters “-.” «-.»
    Match the character “.” literally «.»
    Match a single character present in the list below «[A-Za-z]{2,4}»
    Between 2 and 4 times, as many times as possible, giving back as needed (greedy) «{2,4}»
    A character in the range between “A” and “Z” «A-Z»
    A character in the range between “a” and “z” «a-z»
    Assert position at the end of a line (at the end of the string or before a line break character) «$»

    I’m hoping that comes through here. If not, take a look at http://pastebin.com/qQWtrzzy for the same content.

    I’m not affiliated with RegexBuddy or anything like that. Just a happy user of the tools.

    Regards,

    Richard Quadling.

  • Inori

    Thanks!

  • http://unobtrusive-javascript-applications.blogspot.com/ Joan

    You might be interested in using this free online tool to test regular expressions. Easy to use and understand:
    http://www.gskinner.com/RegExr/

  • Julien

    Very well explained, step by step, thank you!
    NB: There’s a typo in your last snippet: $foundEmail becomes $found_email inside the foreach block.

    • http://zaemis.blogspot.com Timothy Boronczyk

      I’ve fixed the article’s example. Thanks for the good catch!

  • http://timwahrendorff.de Tim

    OT, but related to the article:

    2 $matchesFound = preg_match_all(
    3 ‘/([a-z0-9-_.+%]+@[a-z0-9-.]+.[a-z]{2,4})/Ui’,
    4 $articleWithEmailAddys, $listOfEmails);
    5 if ($matchesFound)
    […]

    ^^ although this will work, it is very bad style you shouldn’t show beginners and shouldn’t do when advanced. Please write if($matchesFound>0) or similar to keep the example code consistent.

  • http://pixopoint.com/ Ryan Hellyer

    Awesome tutorial. Thanks for the lowdown on regex :)

  • http://www.chrispoole.net Chris

    Great little tutorial for those if us still getting to grips with such things.

  • Alan Rew

    Due to the (unnecessary?) dot in the second set of square brackets, this expression allows email addresses like
    fred@joe…….com
    which looks odd – is this a valid email address? Just asking.

  • Mal Curtis

    Sucks if you’re Irish!

    I’d recommend making sure your email regex allows apostrophe’s in the pre @ part.

  • John Mulligan

    Between the article and all the additional comments, I’m quickly getting a handle on this regex stuff. THanks!

  • http://Lukasarts.info Lukas

    Great explanation. PHP manual should have this article ;)

  • Jon

    Excellent explanation. I wish I had read this years ago.

Recommended
Sponsors
Because We Like You
Free Ebooks!

Grab SitePoint's top 10 web dev and design ebooks, completely free!

Get the latest in PHP, once a week, for free.