Practicing Regular Expressions with Search and Replace
If you’re just starting out with regular expressions (regex), the syntax can seem a bit puzzling at first (I would recommend Jason Pasnikowski’s article as a good starting point). One of the things that make it difficult to grasp regex in the beginning is the small number of times you have a chance to use them in your code, which in turn limits the amount of practice you have using them. Professionals in any capacity, be it sports, entertainment, or development always practice – some practice more than others.
So how can you practice using regex if you are limited to just using them in your code? The answer is to use a utility, of which there are many, that uses regex for performing search and replace. I’m sure everyone is familiar with the standard “find x and replace it with y” type of search and replace. Most IDEs and text editors have built in regex engines to handle search and replace. In this article I’d like to walk through a series of exercises to help you practice using regex.
I’ll be using NetBeans for this article. Some editors might have slightly different regex behavior that what you see here, so if you’re using something other than NetBeans and it doesn’t work quite as you’d expect, be sure to read the documentation for your specific editor.
Word Boundaries
Let’s use the following code to start with for our examples; I’ve crafted it specifically to illustrate particular caveats of search and replace as your progress.
<div id="navigation">
<a href="divebomb.php" title="All About Divebombs">Divebombs</a> |
<a href="endives.php" title="All About Endives">Endives</a> |
<a href="indivisible.php" title="Indivisible by Zero">Indivisible Numbers</a> |
<a href="division.php" title="All About Division">Divison</a> |
<a href="skydiving.php" title="All About Skydiving">Skydiving</a> |
</div>
This navigation code should ideally be an unordered list, not free anchors inside div
tags. You can’t just replace the word “div” with “ul” however because divebomb would become ulebomb, endives would become enules, etc. You also can’t use “<div” because it would miss the closing div
tag. You can manually replace the div
tags with ul
tags, or you can use the special sequence b
which denotes a word boundary.
In the Search field, type: bdivb
In the Replace field, type: ul
This only replaces the text “div” that was delimited by word boundaries. Word boundaries allow you to perform whole word only searches, so the word “div” in <div id=”navigation”> and </div> both get matched while the substrings in the anchors are left alone.
Later you’ll also see w
, which is used to match non-whitespace “word” characters.
Groupings and Back References
Continuing with the modified code from the first example, let’s continue refactoring the list. Right now your code should look like this:
<ul id="navigation">
<a href="divebomb.php" title="All About Divebombs">Divebombs</a> |
<a href="endives.php" title="All About Endives">Endives</a> |
<a href="indivisible.php" title="Indivisible by Zero">Indivisible Numbers</a> |
<a href="division.php" title="All About Division">Divison</a> |
<a href="skydiving.php" title="All About Skydiving">Skydiving</a> |
</ul>
You can easily do a standard search and replace on the anchor tags without any of the issues that prevented you from doing so with div, but where is the fun in that? In the spirit of practice, let’s use regex to wrap the anchors in li
tags.
To select the anchors, type the following in the Search field: (<a.*>)
In the Replace field, type: <li>$1</li>
Ignoring the parentheses in the search pattern for now, let’s break up the pattern and discuss each piece of it. The first piece is <a, which tells the regex engine to match a less-than symbol followed by the letter a. The next part of this piece is .*>, which tells the engine to match any character zero or more times followed by a greater-than symbol. This piece matches the anchor tags in the block of code above.
The parentheses in the search pattern perform a special function; they group the individual matches which you can access later. By adding the parentheses, you are telling the regex engine to store the matching result because you’ll need them later. You can access these groups by number.
The replace pattern tells the engine to replace the search pattern with an opening li
tag, followed by the contents in the first grouping, and a closing li
tag. In this example there is only one group (because there is only one set of parenthesis), so the $1
in the middle of the li
tags indicates this is the group you want to use. (Some editors may use 1
instead of $1
. If $1
does work, then undo your replacement and try the other variant.)
You can have multiple groups, and groups can be nested which you’ll see in just a moment. You’re going to modify the patterns you just used to add the li
tags in order to create a more robust navigation. Undo the replacements you’ve just made. Usually something like Ctrl
+Z
works just fine, but if it doesn’t here’s the search and replace patterns to revert the code:
In the Search field, type: <li>(<a.*>)</li>
In the Replace field, type: $1
Multiple Groupings
Alright, now let’s wrap the anchor tags in li
tags complete with class
and id
attributes for use with CSS. To accomplish this, you’ll use the following:
Search: (<a.*>(w+).*</a>)
Replace: <li class="navEntry" id="$2">$1</li>
As in the second example’s search pattern, <a.*> matches the anchor tags. You’re asking the regex engine to find a string that begins with a greater-than symbol, followed by the letter a, followed by series of zero or more characters that ends with a less-than symbol. With w+
you are also asking the engine to look for a sequence of characters that doesn’t contain any whitespace or symbol characters and has a length greater than zero. The parentheses around w+
indicate you want to store the match as a group. Next you added .* to the pattern to match any other characters that may appear before the closing of the anchor tag. The result is that $1
will have the matched anchor string, and $2
will have the first word of the link’s text.
Breaking down the replacement, you begin with li
, a class
attribute and its value, followed by the id
attribute. Instead of providing an id value however you have $2
. This tells the regex engine you want to use the content stored in the second group from the search pattern, which in this case is the w+
. Then you open the li
tag, tell the regex engine you want to use the first grouping ($1
is the entire anchor tag), and finally close the li
tag.
Be careful when you are determining which groups to replace. Consider the following hypothetical example (I’ve used group names instead of patterns to illustrate how grouping works):
(group1(group2))(group3)
Using the above gives you the following results:
$1 = group1group2 $2 = group2 $3 = group3
$1
contains both group1 and group2 because parentheses enclose both of them. This is true even though group2 is a group by itself. And then of course group3 is a group to itself.
To finish cleaning things up, you can remove the non-breaking space entities and the pipe character from the end of the lines and replace them with an empty string (the pipe needs to be preceded by a backslash in the expression because it has special meaning to the engine).
Search: |
Leave the Replace field empty.
You’re code should now look like this – a nice, neat, well-structured list you can use CSS with to style:
<ul id="navigation">
<li class="navEntry" id="Divebombs"><a href="divebomb.php" title="All About Divebombs">Divebombs</a></li>
<li class="navEntry" id="Endives"><a href="endives.php" title="All About Endives">Endives</a></li>
<li class="navEntry" id="Indivisible"><a href="indivisible.php" title="Indivisible by Zero">Indivisible Numbers</a></li>
<li class="navEntry" id="Divison"><a href="division.php" title="All About Division">Divison</a></li>
<li class="navEntry" id="Skydiving"><a href="skydiving.php" title="All About Skydiving">Skydiving</a></li>
</ul>
Summary
Thanks for taking some time to learn a little bit more about regular expressions and practicing with them using search and replace. I encourage anyone who is struggling to grasp the concepts to practice using search and replace in their editor because it’s convenient and generally provides immediate visual feedback. If necessary, you can copy and paste the content you’re working with into a blank file and experiment with it, running replacements and undoing them, until you get what you like.