phpmaster | Practicing Regular Expressions with Search and Replace

Key Takeaways

Regular expressions (regex) can be practiced using search and replace utilities in most text editors and IDEs, allowing developers to gain more experience with regex outside of their code.
Word boundaries in regex, denoted by the special sequence b, allow for whole word only searches, making it easier to replace specific words without affecting substrings within other words.
Groupings and back references in regex can be used to store individual matches for later use. These groups can be accessed by number and can be nested within each other.
Multiple groupings can be used in regex to store multiple matches and apply different replacements to each group. This can be used to create more complex replacements, such as wrapping tags around certain elements or adding attributes to tags.

If you’re just starting out with regular expressions (regex), the syntax can seem a bit puzzling at first (I would recommend Jason Pasnikowski’s article as a good starting point). One of the things that make it difficult to grasp regex in the beginning is the small number of times you have a chance to use them in your code, which in turn limits the amount of practice you have using them. Professionals in any capacity, be it sports, entertainment, or development always practice – some practice more than others.

So how can you practice using regex if you are limited to just using them in your code? The answer is to use a utility, of which there are many, that uses regex for performing search and replace. I’m sure everyone is familiar with the standard “find x and replace it with y” type of search and replace. Most IDEs and text editors have built in regex engines to handle search and replace. In this article I’d like to walk through a series of exercises to help you practice using regex.

I’ll be using NetBeans for this article. Some editors might have slightly different regex behavior that what you see here, so if you’re using something other than NetBeans and it doesn’t work quite as you’d expect, be sure to read the documentation for your specific editor.

Word Boundaries

Let’s use the following code to start with for our examples; I’ve crafted it specifically to illustrate particular caveats of search and replace as your progress.

<div id="navigation">
 <a href="divebomb.php" title="All About Divebombs">Divebombs</a>&nbsp;&nbsp;|&nbsp;&nbsp;
 <a href="endives.php" title="All About Endives">Endives</a>&nbsp;&nbsp;|&nbsp;&nbsp;
 <a href="indivisible.php" title="Indivisible by Zero">Indivisible Numbers</a>&nbsp;&nbsp;|&nbsp;&nbsp;
 <a href="division.php" title="All About Division">Divison</a>&nbsp;&nbsp;|&nbsp;&nbsp;
 <a href="skydiving.php" title="All About Skydiving">Skydiving</a>&nbsp;&nbsp;|&nbsp;&nbsp;
</div>

This navigation code should ideally be an unordered list, not free anchors inside div tags. You can’t just replace the word “div” with “ul” however because divebomb would become ulebomb, endives would become enules, etc. You also can’t use “<div” because it would miss the closing div tag. You can manually replace the div tags with ul

tags, or you can use the special sequence b which denotes a word boundary.

In the Search field, type: bdivb
In the Replace field, type: ul

This only replaces the text “div” that was delimited by word boundaries. Word boundaries allow you to perform whole word only searches, so the word “div” in <div id=”navigation”> and </div> both get matched while the substrings in the anchors are left alone.

Later you’ll also see w, which is used to match non-whitespace “word” characters.

Groupings and Back References

Continuing with the modified code from the first example, let’s continue refactoring the list. Right now your code should look like this:

<ul id="navigation">
 <a href="divebomb.php" title="All About Divebombs">Divebombs</a>&nbsp;&nbsp;|&nbsp;&nbsp;
 <a href="endives.php" title="All About Endives">Endives</a>&nbsp;&nbsp;|&nbsp;&nbsp;
 <a href="indivisible.php" title="Indivisible by Zero">Indivisible Numbers</a>&nbsp;&nbsp;|&nbsp;&nbsp;
 <a href="division.php" title="All About Division">Divison</a>&nbsp;&nbsp;|&nbsp;&nbsp;
 <a href="skydiving.php" title="All About Skydiving">Skydiving</a>&nbsp;&nbsp;|&nbsp;&nbsp;
</ul>

You can easily do a standard search and replace on the anchor tags without any of the issues that prevented you from doing so with div, but where is the fun in that? In the spirit of practice, let’s use regex to wrap the anchors in li tags.

To select the anchors, type the following in the Search field: (<a.*>)
In the Replace field, type: <li>$1</li>

Ignoring the parentheses in the search pattern for now, let’s break up the pattern and discuss each piece of it. The first piece is <a, which tells the regex engine to match a less-than symbol followed by the letter a. The next part of this piece is .*>, which tells the engine to match any character zero or more times followed by a greater-than symbol. This piece matches the anchor tags in the block of code above.

The parentheses in the search pattern perform a special function; they group the individual matches which you can access later. By adding the parentheses, you are telling the regex engine to store the matching result because you’ll need them later. You can access these groups by number.

The replace pattern tells the engine to replace the search pattern with an opening li tag, followed by the contents in the first grouping, and a closing li tag. In this example there is only one group (because there is only one set of parenthesis), so the $1 in the middle of the li tags indicates this is the group you want to use. (Some editors may use 1 instead of $1. If $1 does work, then undo your replacement and try the other variant.)

You can have multiple groups, and groups can be nested which you’ll see in just a moment. You’re going to modify the patterns you just used to add the li tags in order to create a more robust navigation. Undo the replacements you’ve just made. Usually something like Ctrl+Z works just fine, but if it doesn’t here’s the search and replace patterns to revert the code:

In the Search field, type: <li>(<a.*>)</li>
In the Replace field, type: $1

Multiple Groupings

Alright, now let’s wrap the anchor tags in li tags complete with class and id attributes for use with CSS. To accomplish this, you’ll use the following:

Search: (<a.*>(w+).*</a>)
Replace: <li class="navEntry" id="$2">$1</li>

As in the second example’s search pattern, <a.*> matches the anchor tags. You’re asking the regex engine to find a string that begins with a greater-than symbol, followed by the letter a, followed by series of zero or more characters that ends with a less-than symbol. With w+ you are also asking the engine to look for a sequence of characters that doesn’t contain any whitespace or symbol characters and has a length greater than zero. The parentheses around w+ indicate you want to store the match as a group. Next you added .* to the pattern to match any other characters that may appear before the closing of the anchor tag. The result is that $1 will have the matched anchor string, and $2 will have the first word of the link’s text.

Breaking down the replacement, you begin with li, a class attribute and its value, followed by the id attribute. Instead of providing an id value however you have $2. This tells the regex engine you want to use the content stored in the second group from the search pattern, which in this case is the w+. Then you open the li tag, tell the regex engine you want to use the first grouping ($1 is the entire anchor tag), and finally close the li tag.

Be careful when you are determining which groups to replace. Consider the following hypothetical example (I’ve used group names instead of patterns to illustrate how grouping works):

(group1(group2))(group3)

Using the above gives you the following results:

$1 = group1group2
$2 = group2
$3 = group3

$1 contains both group1 and group2 because parentheses enclose both of them. This is true even though group2 is a group by itself. And then of course group3 is a group to itself.

To finish cleaning things up, you can remove the non-breaking space entities and the pipe character from the end of the lines and replace them with an empty string (the pipe needs to be preceded by a backslash in the expression because it has special meaning to the engine).

Search:   |  
Leave the Replace field empty.

You’re code should now look like this – a nice, neat, well-structured list you can use CSS with to style:

<ul id="navigation">
 <li class="navEntry" id="Divebombs"><a href="divebomb.php" title="All About Divebombs">Divebombs</a></li>
 <li class="navEntry" id="Endives"><a href="endives.php" title="All About Endives">Endives</a></li>
 <li class="navEntry" id="Indivisible"><a href="indivisible.php" title="Indivisible by Zero">Indivisible Numbers</a></li>
 <li class="navEntry" id="Divison"><a href="division.php" title="All About Division">Divison</a></li>
 <li class="navEntry" id="Skydiving"><a href="skydiving.php" title="All About Skydiving">Skydiving</a></li>
</ul>

Summary

Thanks for taking some time to learn a little bit more about regular expressions and practicing with them using search and replace. I encourage anyone who is struggling to grasp the concepts to practice using search and replace in their editor because it’s convenient and generally provides immediate visual feedback. If necessary, you can copy and paste the content you’re working with into a blank file and experiment with it, running replacements and undoing them, until you get what you like.

Frequently Asked Questions (FAQs) about Regular Expressions

What are the basic components of a regular expression?

Regular expressions, often abbreviated as regex, are sequences of characters that define a search pattern. The basic components of a regular expression include literals, metacharacters, and quantifiers. Literals are the actual characters you want to match, such as ‘a’, ‘1’, or ‘#’. Metacharacters are special characters that have unique meanings in regex, like ‘.’ (matches any character except newline), ‘*’ (matches zero or more of the preceding element), and ‘^’ (matches the start of the line). Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found.

How can I test my regular expressions?

There are several online tools available to test your regular expressions. Websites like regexr.com and regexone.com allow you to input your regex and test it against a string of text. These tools highlight matches and provide explanations for your regex, which can be very helpful for beginners. Additionally, most programming languages have built-in functions to test regular expressions.

What are some common uses of regular expressions?

Regular expressions are used in programming and text processing to find and manipulate text based on specific patterns. They are commonly used for tasks like data validation (e.g., checking if a user’s input is a valid email address), search and replace operations, parsing and splitting strings, and syntax highlighting.

How can I match a specific number of occurrences with regex?

To match a specific number of occurrences with regex, you can use curly braces {}. For example, the regex ‘a{3}’ will match exactly three ‘a’ characters. You can also specify a range, like ‘a{2,4}’ which will match two to four ‘a’ characters.

What is the difference between greedy and lazy quantifiers in regex?

Greedy quantifiers match as many instances of a pattern as possible, while lazy quantifiers match as few as possible. For example, in the string ‘aaaaa’, the regex ‘a*’ (a greedy quantifier) will match all five ‘a’ characters, while ‘a*?’ (a lazy quantifier) will match just the first ‘a’.

How can I match any character except a specific one in regex?

To match any character except a specific one, you can use a negated character class. This is denoted by a caret ‘^’ inside square brackets ‘[]’. For example, ‘[^a]’ will match any character except ‘a’.

What are groups and how are they used in regex?

Groups in regex are portions of the pattern enclosed in parentheses ‘()’. They allow you to apply quantifiers to multiple characters, capture the text matched for later use, or create a condition around a pattern. For example, ‘(ab)*’ will match zero or more occurrences of ‘ab’.

How can I match the start and end of a line in regex?

The caret ‘^’ matches the start of a line, and the dollar sign ‘ matches the end of a line. For example, ‘^a’ will match any line that starts with ‘a’, and ‘a will match any line that ends with ‘a’.

How can I match a word boundary in regex?

The ‘\b’ metacharacter matches a word boundary. This is the position where a word character is not followed or preceded by another word-character, such as between a letter and a space.

How can I escape special characters in regex?

To escape special characters in regex, you use the backslash ”. For example, to match the literal character ‘.’, you would write ‘.’ in your regex.