phpmaster | Using PHP Regular Expressions

^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$

It makes all the sense of ancient Egyptian hieroglyphics to you, although those little pictures at least look like they have meaning. But this… this looks like gibberish. What does it mean? It means oleomarg32@hotmail.com, Fiery.Rebel@veneuser.info, robustlamp+selfmag@gmail.ca, or nearly any other simple email address because this is a pattern written in a language that describes how to match text in strings. When you’re looking to go beyond straight text matches, like finding “stud” in “Mustard” (which would fail btw), and you need a way to “explain” what you’re looking for because each instance may be different, you’ve come to need Regular Expressions, affectionately called regex.

Intro to Regex Notation

To get your feet wet, let’s take the above example and break it down piece by piece. ^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
The beginning of a line can be detected similar to carriage returns, even though it isn’t really an invisible character. Using ^ tells the regex engine that the match must start at the beginning of the line. ^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
Instead of specifying each and every character of the alphabet, we have a shorthand that gives a range. Usually it is case sensitive so you’ll have to specify both an uppercase and lowercase range. ^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
The same goes for numbers; we can simply shorten them to a range instead of writing all 10 digits. ^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
These are the special characters we’re allowing: a dash, underscore, dot, plus-sign, and percent-sign. ^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
The brackets surrounding our ranges effectively take everything you’ve put between them to create your own custom wildcard. Our “wildcard” is capable of matching any letter A-Z in either uppercase or lowercase, a digit 0-9, or one of our special punctuation characters. ^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
This is a quantifier; it modifies how many times the previous character should match, in this case, the previous character is the set within brackets. + means “at least one,” so in our example, after matching the beginning of the string, we have to have at least one of the characters within the brackets. At this point we can match (given the sample email addresses from earlier) oleomarg23, Fiery.Rebel, and robustlamp+selfmag. Something like @SodaCanDrive.com would fail because we must have at least one of the characters in the bracketed set at the very beginning of the text. In addition to + as a quantifier, there is * which is almost identical except that it will match if there are no matches at all. If we replaced the first + quantifier in the sample with * and had this:

^[A-Za-z0-9-_.+%]*@[A-Za-z0-9-.]+.[A-Za-z]{2,4}

it would have successfully matched the string @SodaCanDrive.com as we are effectively telling the regex engine to keep matching until it comes across a character not in the set, even if there aren’t any. Back to our original pattern… ^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
The @ matches literally, so we now we’ve matched oleomarg23@, Fiery.Rebel@, and robustlamp+selfmag@

. The text greencandelabra.com fails because it doesn’t have an at-sign! ^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
This portion of the expression is similar to what we matched before the at-sign except this time we’re not allowing the underscore, plus-sign, or percent-sign. Now we’re up to oleomarg23@hotmail.com, Fiery.Rebel@veneuser.info and robustlamp+selfmag@gmail.ca. gnargly3.1415@pie_a_la_mode.com would only match up to gnarly3.1415@pie. ^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
Here we have an escaped dot so as to match it literally. Note the plus-sign matched literally when it was inside brackets, but outside it had special meaning as a quantifier. Outside the brackets, the dot has to be escaped or it will be treated as the wildcard character; inside the brackets, a dot means a dot. Uh oh! Since we already matched the .com, .info and .ca, it would seem like the match would fail because we don’t have any more dots. But regex is smart: the matching engine tries backtracking to find the match. So now we’re back to oleomarg23@hotmail., Fiery.Rebel@veneuser. and robustlamp+selfmag@gmail.. At this point, gnargly3.1415@pie_a_la_mode.com fails because the character after what’s matching so far is not a dot. drnddog@chewwed.legs.onchair continues as drnddog@chewwed.legs.. ^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
Remember how we made our own custom wildcard using brackets? We can do a similar thing with braces to make custom quantifiers. {2,4} means to match at least two times but no more than four times. If we only wanted to match exactly two times, we would write {2}. We could handle any quantity up to a maximum of four with {0,4}. {2,} would match a minimum of two. {2,4} is our special quantifier that limits the last wildcard match to any 2, 3, or 4 letters or dots. We’ve nearly fully matched oleomarg23@hotmail.com, Fiery.Rebel@venuser.info and robustlamp+selfmag@gmail.ca. drnddog@chewwed.legs.onchair has to goes backwards further to drnddog@chewwed.legs to make the match. We just have one more to go… ^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$
$ is the counter-part to ^. Just as ^

does for the start of the line, $ anchors the match to the end of the line. Our examples all match now, and drnddog@chewwed.legs.onchair fails because there isn’t 2, 3, or 4 letters preceded by a dot at the end of the string.

Regexs in PHP

It’s all well and good to have this basic understanding of the notation used by regular expressions, but we still need to know how to apply it in the context of PHP to actually do something productive, so let’s look at the function preg_match(), preg_replace() and preg_match_all().

preg_match()

To validate a form field for an email address, we’d use preg_match():

<?php
if (preg_match('/^[A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4}$/',
    $_POST["emailAddy"])) {
    echo "Email address accepted";
}
else {
    echo "Email address is all broke.";
}

If a match is found, preg_match() returns 1, otherwise 0. Notice that we added slashes to the beginning and end of the regex. These are used as delimiters to show the function where the regular expression begins and ends. You may ask, “But Jason, isn’t that what the quotes are for?” Let me assure you that there is more to it, as I will explain shortly.

preg_replace()

To find an email address and add formating, we would use preg_replace():

<?php
$formattedBlock = preg_replace(
    '/([A-Za-z0-9-_.+%]+@[A-Za-z0-9-.]+.[A-Za-z]{2,4})/U',
    "<b>\1</b>", $blockOText);

Here’s that explanation that was promised: we’ve placed a U after the ending delimiter as a flag that modifies how the regex matches. We’ve seen how regex matches are greedy, gobbling up as many characters as it can and only backtracking if it has to. U makes the regex “un-greedy.” Without it, the string tweedle@dee.com-and-tweedle@dum.com would match as one. But by making it un-greedy, we tell it to find the shortest matching pattern… just tweedle@dee.com. Did you notice we also wrapped the the whole expression in parentheses? This causes the regex engine to capture a copy of the text that matches the expression between the parenthesis which we can reference with a back-reference (1). The second argument to preg_replace() is telling the function to replace the text with an opening bold tag, whatever matched the pattern between the first set of parenthesis, and a closing bold tag. If there were other sets of parenthesis, they could be referenced with 2, 3, etc. depending on their position.

preg_match_all()

To scan some text and extract an array of all the email addresses found in it, preg_match_all() is our best choice:

<?php
$matchesFound = preg_match_all(
    '/([a-z0-9-_.+%]+@[a-z0-9-.]+.[a-z]{2,4})/Ui',
    $articleWithEmailAddys, $listOfEmails);
if ($matchesFound) {
    foreach ($listOfEmails[0] as $foundEmail) {
        echo $foundEmail . "<br>";
    }
}

preg_match_all() returns how many matches it found, and sticks those matches into the variable reference we supplied as the third argument. It actually creates a multi-dimensional array in which the matches we’re looking for are found at index 0. In addition to the U modifier, we provided i which instructs the regex engine we want the pattern to be applied in a case-insensitive manner. That is, /a/i would match both a lower-case A and an upper-case A (or /A/i would work equally well for that matter since the modifier is asking the engine to be case-agnostic). This allows us to write things like [a-z0-9] in our expression now instead of [A-Za-z0-9] which makes it a little shorter and easier to grok. Well that about wraps things up. While there is a lot more you can do using regular expressions involving look-ahead, look-behind, and more intricate examples of back-references, all of which you can find in PHP’s online documentation, hopefully you have plenty to work with that will serve you for many scripts just from this article. Image via Boris Mrdja / Shutterstock

Frequently Asked Questions (FAQs) about Regular Expressions

What are the basic components of a regular expression?

Regular expressions, often abbreviated as regex, are sequences of characters that define a search pattern. They are primarily used for string pattern matching and manipulation. The basic components of a regular expression include literals, metacharacters, and quantifiers. Literals are standard characters that match themselves exactly. Metacharacters are special characters that have unique meanings, such as the dot (.) that matches any character except a newline. Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found.

How do I use regular expressions in PHP?

In PHP, regular expressions are used with the preg_match(), preg_match_all(), and preg_replace() functions. The preg_match() function is used to search a string for a pattern, returning true if the pattern exists, and false otherwise. The preg_match_all() function is similar, but returns all matches in the string. The preg_replace() function is used to search a string for a pattern, and replace it with specified text.

What is the role of metacharacters in regular expressions?

Metacharacters are the building blocks of regular expressions. They have special meanings when used in a pattern. For example, the dot (.) metacharacter matches any character except a newline, the asterisk (*) matches zero or more occurrences of the preceding element, and the plus (+) matches one or more occurrences of the preceding element. Understanding metacharacters is crucial to mastering regular expressions.

How can I match any single character in a regular expression?

The dot (.) metacharacter is used to match any single character in a regular expression, except for a newline. For example, the pattern “c.t” would match “cat”, “cut”, “cot”, and so on. If you want to match any character including a newline, you can use the dot-all mode, which is activated by adding the ‘s’ flag after the pattern.

What is the difference between greedy and lazy quantifiers in regular expressions?

Greedy and lazy quantifiers in regular expressions determine how many times a pattern should be matched. A greedy quantifier will match as many instances of a pattern as possible, while a lazy quantifier will match as few as possible. For example, in the pattern “a.b”, the “.” is a greedy quantifier and will match as many characters as possible between “a” and “b”. If you want to make it lazy, you can use “.*?”, which will match as few characters as possible.

How can I match the start and end of a string in a regular expression?

The caret (^) and dollar sign ($) metacharacters are used to match the start and end of a string, respectively. For example, “^a” would match any string that starts with “a”, and “a$” would match any string that ends with “a”.

How can I match a specific number of occurrences in a regular expression?

You can use curly braces ({}) to specify a specific number of occurrences in a regular expression. For example, “a{3}” would match exactly three “a” characters.

What is the difference between character classes and character sets in regular expressions?

Character classes and character sets are similar in that they both match one character out of several possible characters. However, character classes are predefined sequences of characters, such as \d for digits, while character sets are defined by the user using square brackets ([]). For example, [abc] would match any single character that is either “a”, “b”, or “c”.

How can I use regular expressions to validate user input?

Regular expressions can be used to validate user input by matching the input against a specific pattern. For example, you can use the pattern “^[a-zA-Z0-9_]{1,}$” to validate a username that should only contain alphanumeric characters and underscores, and should be at least one character long.

How can I match a pattern across multiple lines in a regular expression?

You can use the multiline mode in regular expressions to match a pattern across multiple lines. This is activated by adding the ‘m’ flag after the pattern. In multiline mode, the caret (^) and dollar sign ($) metacharacters match the start and end of each line, rather than the start and end of the entire string.