PHP - - By Jason Pasnikowski

Using PHP Regular Expressions



It makes all the sense of ancient Egyptian hieroglyphics to you, although those little pictures at least look like they have meaning. But this… this looks like gibberish. What does it mean? It means,,,
or nearly any other simple email address because this is a pattern written in a language that describes how to match text in strings. When you’re looking to go beyond straight text matches, like finding “stud” in “Mustard” (which would fail btw), and you need a way to “explain” what you’re looking for because each instance may be different, you’ve come to need Regular Expressions, affectionately called regex.

Intro to Regex Notation

To get your feet wet, let’s take the above example and break it down piece by piece.

The beginning of a line can be detected similar to carriage returns, even though it isn’t really an invisible character. Using ^ tells the regex engine that the match must start at the beginning of the line.

Instead of specifying each and every character of the alphabet, we have a shorthand that gives a range. Usually it is case sensitive so you’ll have to specify both an uppercase and lowercase range.

The same goes for numbers; we can simply shorten them to a range instead of writing all 10 digits.

These are the special characters we’re allowing: a dash, underscore, dot, plus-sign, and percent-sign.

The brackets surrounding our ranges effectively take everything you’ve put between them to create your own custom wildcard. Our “wildcard” is capable of matching any letter A-Z in either uppercase or lowercase, a digit 0-9, or one of our special punctuation characters.

This is a quantifier; it modifies how many times the previous character should match, in this case, the previous character is the set within brackets. + means “at least one,” so in our example, after matching the beginning of the string, we have to have at least one of the characters within the brackets.

At this point we can match (given the sample email addresses from earlier) oleomarg23, Fiery.Rebel, and robustlamp+selfmag. Something like would fail because we must have at least one of the characters in the bracketed set at the very beginning of the text.

In addition to + as a quantifier, there is * which is almost identical except that it will match if there are no matches at all. If we replaced the first + quantifier in the sample with * and had this:


it would have successfully matched the string as we are effectively telling the regex engine to keep matching until it comes across a character not in the set, even if there aren’t any.

Back to our original pattern…

The @ matches literally, so we now we’ve matched oleomarg23@, Fiery.Rebel@, and robustlamp+selfmag@. The text fails because it doesn’t have an at-sign!

This portion of the expression is similar to what we matched before the at-sign except this time we’re not allowing the underscore, plus-sign, or percent-sign. Now we’re up to, and would only match up to gnarly3.1415@pie.

Here we have an escaped dot so as to match it literally. Note the plus-sign matched literally when it was inside brackets, but outside it had special meaning as a quantifier. Outside the brackets, the dot has to be escaped or it will be treated as the wildcard character; inside the brackets, a dot means a dot.

Uh oh! Since we already matched the .com, .info and .ca, it would seem like the match would fail because we don’t have any more dots. But regex is smart: the matching engine tries backtracking to find the match. So now we’re back to oleomarg23@hotmail., Fiery.Rebel@veneuser. and robustlamp+selfmag@gmail..

At this point, fails because the character after what’s matching so far is not a dot. drnddog@chewwed.legs.onchair continues as drnddog@chewwed.legs..

Remember how we made our own custom wildcard using brackets? We can do a similar thing with braces to make custom quantifiers. {2,4} means to match at least two times but no more than four times. If we only wanted to match exactly two times, we would write {2}. We could handle any quantity up to a maximum of four with {0,4}. {2,} would match a minimum of two.

{2,4} is our special quantifier that limits the last wildcard match to any 2, 3, or 4 letters or dots. We’ve nearly fully matched, and drnddog@chewwed.legs.onchair has to goes backwards further to drnddog@chewwed.legs to make the match.

We just have one more to go…

$ is the counter-part to ^. Just as ^ does for the start of the line, $ anchors the match to the end of the line. Our examples all match now, and drnddog@chewwed.legs.onchair fails because there isn’t 2, 3, or 4 letters preceded by a dot at the end of the string.

Regexs in PHP

It’s all well and good to have this basic understanding of the notation used by regular expressions, but we still need to know how to apply it in the context of PHP to actually do something productive, so let’s look at the function preg_match(), preg_replace() and preg_match_all().


To validate a form field for an email address, we’d use preg_match():

if (preg_match('/^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$/',
    $_POST["emailAddy"])) {
    echo "Email address accepted";
else {
    echo "Email address is all broke.";

If a match is found, preg_match() returns 1, otherwise 0. Notice that we added slashes to the beginning and end of the regex. These are used as delimiters to show the function where the regular expression begins and ends. You may ask, “But Jason, isn’t that what the quotes are for?” Let me assure you that there is more to it, as I will explain shortly.


To find an email address and add formating, we would use preg_replace():

$formattedBlock = preg_replace(
    "<b>\1</b>", $blockOText);

Here’s that explanation that was promised: we’ve placed a U after the ending delimiter as a flag that modifies how the regex matches. We’ve seen how regex matches are greedy, gobbling up as many characters as it can and only backtracking if it has to. U makes the regex “un-greedy.” Without it, the string would match as one. But by making it un-greedy, we tell it to find the shortest matching pattern… just

Did you notice we also wrapped the the whole expression in parentheses? This causes the regex engine to capture a copy of the text that matches the expression between the parenthesis which we can reference with a back-reference (1). The second argument to preg_replace() is telling the function to replace the text with an opening bold tag, whatever matched the pattern between the first set of parenthesis, and a closing bold tag. If there were other sets of parenthesis, they could be referenced with 2, 3, etc. depending on their position.


To scan some text and extract an array of all the email addresses found in it, preg_match_all() is our best choice:

$matchesFound = preg_match_all(
    $articleWithEmailAddys, $listOfEmails);
if ($matchesFound) {
    foreach ($listOfEmails[0] as $foundEmail) {
        echo $foundEmail . "<br>";

preg_match_all() returns how many matches it found, and sticks those matches into the variable reference we supplied as the third argument. It actually creates a multi-dimensional array in which the matches we’re looking for are found at index 0.

In addition to the U modifier, we provided i which instructs the regex engine we want the pattern to be applied in a case-insensitive manner. That is, /a/i would match both a lower-case A and an upper-case A (or /A/i would work equally well for that matter since the modifier is asking the engine to be case-agnostic). This allows us to write things like [a-z0-9] in our expression now instead of [A-Za-z0-9] which makes it a little shorter and easier to grok.

Well that about wraps things up. While there is a lot more you can do using regular expressions involving look-ahead, look-behind, and more intricate examples of back-references, all of which you can find in PHP’s online documentation, hopefully you have plenty to work with that will serve you for many scripts just from this article.

Image via Boris Mrdja / Shutterstock