Practicing Regular Expressions with Search and Replace

    Chris Roberts
    Share

    If you’re just starting out with regular expressions (regex), the syntax can seem a bit puzzling at first (I would recommend Jason Pasnikowski’s article as a good starting point). One of the things that make it difficult to grasp regex in the beginning is the small number of times you have a chance to use them in your code, which in turn limits the amount of practice you have using them. Professionals in any capacity, be it sports, entertainment, or development always practice – some practice more than others.

    So how can you practice using regex if you are limited to just using them in your code? The answer is to use a utility, of which there are many, that uses regex for performing search and replace. I’m sure everyone is familiar with the standard “find x and replace it with y” type of search and replace. Most IDEs and text editors have built in regex engines to handle search and replace. In this article I’d like to walk through a series of exercises to help you practice using regex.

    I’ll be using NetBeans for this article. Some editors might have slightly different regex behavior that what you see here, so if you’re using something other than NetBeans and it doesn’t work quite as you’d expect, be sure to read the documentation for your specific editor.

    Word Boundaries

    Let’s use the following code to start with for our examples; I’ve crafted it specifically to illustrate particular caveats of search and replace as your progress.

    <div id="navigation">
     <a href="divebomb.php" title="All About Divebombs">Divebombs</a>&nbsp;&nbsp;|&nbsp;&nbsp;
     <a href="endives.php" title="All About Endives">Endives</a>&nbsp;&nbsp;|&nbsp;&nbsp;
     <a href="indivisible.php" title="Indivisible by Zero">Indivisible Numbers</a>&nbsp;&nbsp;|&nbsp;&nbsp;
     <a href="division.php" title="All About Division">Divison</a>&nbsp;&nbsp;|&nbsp;&nbsp;
     <a href="skydiving.php" title="All About Skydiving">Skydiving</a>&nbsp;&nbsp;|&nbsp;&nbsp;
    </div>

    This navigation code should ideally be an unordered list, not free anchors inside div tags. You can’t just replace the word “div” with “ul” however because divebomb would become ulebomb, endives would become enules, etc. You also can’t use “<div” because it would miss the closing div tag. You can manually replace the div tags with ul tags, or you can use the special sequence b which denotes a word boundary.

    In the Search field, type: bdivb
    In the Replace field, type: ul

    This only replaces the text “div” that was delimited by word boundaries. Word boundaries allow you to perform whole word only searches, so the word “div” in <div id=”navigation”> and </div> both get matched while the substrings in the anchors are left alone.

    Later you’ll also see w, which is used to match non-whitespace “word” characters.

    Groupings and Back References

    Continuing with the modified code from the first example, let’s continue refactoring the list. Right now your code should look like this:

    <ul id="navigation">
     <a href="divebomb.php" title="All About Divebombs">Divebombs</a>&nbsp;&nbsp;|&nbsp;&nbsp;
     <a href="endives.php" title="All About Endives">Endives</a>&nbsp;&nbsp;|&nbsp;&nbsp;
     <a href="indivisible.php" title="Indivisible by Zero">Indivisible Numbers</a>&nbsp;&nbsp;|&nbsp;&nbsp;
     <a href="division.php" title="All About Division">Divison</a>&nbsp;&nbsp;|&nbsp;&nbsp;
     <a href="skydiving.php" title="All About Skydiving">Skydiving</a>&nbsp;&nbsp;|&nbsp;&nbsp;
    </ul>

    You can easily do a standard search and replace on the anchor tags without any of the issues that prevented you from doing so with div, but where is the fun in that? In the spirit of practice, let’s use regex to wrap the anchors in li tags.

    To select the anchors, type the following in the Search field: (<a.*>)
    In the Replace field, type: <li>$1</li>

    Ignoring the parentheses in the search pattern for now, let’s break up the pattern and discuss each piece of it. The first piece is <a, which tells the regex engine to match a less-than symbol followed by the letter a. The next part of this piece is .*>, which tells the engine to match any character zero or more times followed by a greater-than symbol. This piece matches the anchor tags in the block of code above.

    The parentheses in the search pattern perform a special function; they group the individual matches which you can access later. By adding the parentheses, you are telling the regex engine to store the matching result because you’ll need them later. You can access these groups by number.

    The replace pattern tells the engine to replace the search pattern with an opening li tag, followed by the contents in the first grouping, and a closing li tag. In this example there is only one group (because there is only one set of parenthesis), so the $1 in the middle of the li tags indicates this is the group you want to use. (Some editors may use 1 instead of $1. If $1 does work, then undo your replacement and try the other variant.)

    You can have multiple groups, and groups can be nested which you’ll see in just a moment. You’re going to modify the patterns you just used to add the li tags in order to create a more robust navigation. Undo the replacements you’ve just made. Usually something like Ctrl+Z works just fine, but if it doesn’t here’s the search and replace patterns to revert the code:

    In the Search field, type: <li>(<a.*>)</li>
    In the Replace field, type: $1

    Multiple Groupings

    Alright, now let’s wrap the anchor tags in li tags complete with class and id attributes for use with CSS. To accomplish this, you’ll use the following:

    Search: (<a.*>(w+).*</a>)
    Replace: <li class="navEntry" id="$2">$1</li>

    As in the second example’s search pattern, <a.*> matches the anchor tags. You’re asking the regex engine to find a string that begins with a greater-than symbol, followed by the letter a, followed by series of zero or more characters that ends with a less-than symbol. With w+ you are also asking the engine to look for a sequence of characters that doesn’t contain any whitespace or symbol characters and has a length greater than zero. The parentheses around w+ indicate you want to store the match as a group. Next you added .* to the pattern to match any other characters that may appear before the closing of the anchor tag. The result is that $1 will have the matched anchor string, and $2 will have the first word of the link’s text.

    Breaking down the replacement, you begin with li, a class attribute and its value, followed by the id attribute. Instead of providing an id value however you have $2. This tells the regex engine you want to use the content stored in the second group from the search pattern, which in this case is the w+. Then you open the li tag, tell the regex engine you want to use the first grouping ($1 is the entire anchor tag), and finally close the li tag.

    Be careful when you are determining which groups to replace. Consider the following hypothetical example (I’ve used group names instead of patterns to illustrate how grouping works):

    (group1(group2))(group3)

    Using the above gives you the following results:

    $1 = group1group2
    $2 = group2
    $3 = group3

    $1 contains both group1 and group2 because parentheses enclose both of them. This is true even though group2 is a group by itself. And then of course group3 is a group to itself.

    To finish cleaning things up, you can remove the non-breaking space entities and the pipe character from the end of the lines and replace them with an empty string (the pipe needs to be preceded by a backslash in the expression because it has special meaning to the engine).

    Search: &nbsp;&nbsp;|&nbsp;&nbsp;
    Leave the Replace field empty.

    You’re code should now look like this – a nice, neat, well-structured list you can use CSS with to style:

    <ul id="navigation">
     <li class="navEntry" id="Divebombs"><a href="divebomb.php" title="All About Divebombs">Divebombs</a></li>
     <li class="navEntry" id="Endives"><a href="endives.php" title="All About Endives">Endives</a></li>
     <li class="navEntry" id="Indivisible"><a href="indivisible.php" title="Indivisible by Zero">Indivisible Numbers</a></li>
     <li class="navEntry" id="Divison"><a href="division.php" title="All About Division">Divison</a></li>
     <li class="navEntry" id="Skydiving"><a href="skydiving.php" title="All About Skydiving">Skydiving</a></li>
    </ul>

    Summary

    Thanks for taking some time to learn a little bit more about regular expressions and practicing with them using search and replace. I encourage anyone who is struggling to grasp the concepts to practice using search and replace in their editor because it’s convenient and generally provides immediate visual feedback. If necessary, you can copy and paste the content you’re working with into a blank file and experiment with it, running replacements and undoing them, until you get what you like.