The Java Regex API Explained

Tweet

It was a long time coming, but the java.util.regex package was a significant and hugely useful addition to Java 1.4. For Web developers constantly dealing with text-based content, this represents a major productivity and efficiency boost. Java regular expressions can be used in client-side Java applets and also in server-side J2EE and JSP code.

Using regular expressions and the regex package, you can easily describe, locate and manipulate complex patterns of text. Trust me, this is definitely a "How did I ever get by without it?" kind of thing.

In this article I’ll explain the general idea behind regular expressions, explain how the java.util.regex package works, then wrap up with a quick look at how the String class has been retrofitted to take advantage of regular expressions.

Before we get into the details of the Java regex API itself, let’s have a quick look at what a regular expression, or, to those in the trade, a ‘regex’, actually is. If you already know what a regular expression is, feel free to skim over this next section.

What is a Regular Expression?

A regular expression is a series of metacharacters and literals that allow you to describe substrings in text using a pattern. These metacharacters actually form a miniature language in their own right. In fact, in many ways, you can think of regular expressions as a kind of SQL query for free flowing text. Consider the following sentence:

My name is Will and I live in williamstown.

How could we find all occurrences of the text ‘Will’, regardless of whether or not an upper or lowercase ‘w’ was used? With regular expressions you can describe this requirement by composing a pattern made from a series of metacharacters and literals. Here is such a pattern:

[Ww]ill

This one’s pretty straightforward. The interesting part is the [Ww] grouping — it indicates that any one of the letters enclosed within the brackets (in this case, either an uppercase ‘W’ or a lowercase ‘w’) is acceptable. So, this regular expression will match text that begins with an uppercase or lowercase w, and is followed by the literals i, then l, and then another l.

Let’s step it up a notch. The above regular expression will actually match 2 occurrences of will — the name Will and the first 4 characters of text in williamstown. We may only have wanted to search for will and Will, and not for words that simply contain these 4 characters in sequence. Here’s an improved version:

b[Ww]illb

The b is how we describe a word boundary. A word boundary will match the likes of spaces, tabs, and the beginning and end points of a line. This effectively rules out williamstown as a match because the second l in williamtown is not followed by a word boundary — it’s followed by an i.

I could dedicate a whole article to the fine art of crafting regular expressions, but my focus here is on the Java regular expression package itself. So, let’s examine one more regular expression — we’ll stick with this one throughout the rest of the article.

  (w+)@(w+.)(w+)(.w+)?

Let’s take a divide-and-conquer approach to analyzing this pattern. The (w+) grouping (it appears twice — examine the one at the start) looks for word characters, as denoted by the w. The + indicates that one or more word characters must appear (not necessarily the same one). This must be followed by a literal @ character. The parentheses are not actually required here, but they do divide the expression into groupings, and you’ll soon see that forming logical groupings in this manner can be extremely useful.

Based on this first portion of our example regex, the (w+)@ portion, here are a few examples that meet the requirements so far:

  billy@ 
 joe@
 francisfordcoppola@

Let’s move along to the next portion. The (w+.) grouping is similar, but expects a period to follow in order to make a match. The period has been escaped using a backslash because the period character is itself a regex meta-character (a wildcard that matches any character). You must always escape metacharacters in this way if you want to match on their literal meaning.
Let’s take a look at a few examples that would meet the requirements so far:

  billy@webworld. 
 joe@optus.
 francisfordcoppola@myisp.

The (w+) grouping is identical to the first grouping — it looks for one or more word characters. So, as you’ve no doubt realised already, our regular expression is intended to match email addresses.

A few examples that meet the requirements so far:

  billy@webworld.com 
 joe@optus.net
 francisfordcoppola@myisp.com

We’re nearly there. The (.w+)* grouping should mostly make sense at this point — we’re looking for a period followed by one or more word characters. But what’s with the * after the closing parentheses? In the world of regular expressions, we use * to denote that the preceding metacharacter, literal or group can occur zero or more times. As an example, wd* would match a word character followed by zero or more digits. In our example, we use parentheses to group together a series of metacharacters, so the * applies to the whole group. So, you can interpret (.w+)* as ‘match a period followed by one or more word characters, and match that combination zero or more times’.

A few examples that meet the requirements of the complete regular expression:

  fred@vianet.com 
 barney@comcorp.net.au
 wilma@mjinteractive.iinet.net.au

With our regular expression crafted, it’s time to move on to the Java side of things. The very first thing you will need to know is how to combat the rather unfortunate syntax clash between Java strings and regular expressions. It’s a clash that you, the developer, must deal with.

Java Safe Regular Expressions

It’s slightly annoying, but the fact remains that you will need to make your regular expressions safe for use in Java code. This means that any backslash delimited metacharacters will need to be escaped. This is because the backslash character has its own special meaning in Java. So, our example email address regex would have to be rewritten as follows:

  String emailRegEx = "(\w+)@(\w+\.)(\w+)(\.\w+)*";

Keep in mind that if you actually need to match against a literal backslash, you must double up yet again. It can be more difficult to read a Java safe regex, so you may want first to craft a ‘regular’ regular expression (a regregex perhaps?) and keep a copy handy — perhaps inside a code comment.

So, how do we use all this to achieve something useful? In certain situations, you can simply call methods such as replace() and replaceAll() directly on the String class — we’ll take a quick look at this approach later. However, for more sophisticated regex operatations, you will be far better served by taking a more object oriented approach.

The Pattern Class

Here’s something refreshing: the java.util.regex package only contains three classes — and one of those is an exception! As you would expect, this makes for a very easy-to-learn API. Here are the 3 steps you would generally follow to use the regex package:

  1. Compile your regex string using the Pattern class.
  2. Use the Pattern class to get a Matcher object.
  3. Call methods on the Matcher to get at any matches.

We will look at the Matcher class next, but let’s dive in with a look at the Pattern class. This class lets you compile your regular expression — this effectively optimises it for efficiency and use by multiple target strings (strings which you want to test the compiled regular expression against). Consider the following example:

      String emailRegEx = "(\w+)@(\w+\.)(\w+)(\.\w+)*"; 
     // Compile and get a reference to a Pattern object.
     Pattern pattern = Pattern.compile(emailRegEx);
     // Get a matcher object - we cover this next.
     Matcher matcher = pattern.matcher(emailRegEx);

Take note that the Pattern object was retrieved via the Pattern class’s static compile method — you cannot instantiate a Pattern object using new. Once you have a Pattern object you can use it to get a reference to a Matcher object. We look at Matcher next.

The Matcher Class

Earlier, I suggested that regular expressions are a kind of SQL query for free flowing text. The analogy is not entirely perfect, but when using the regex API it can help to think along these lines. If you think of Pattern.compile(myRegEx) as being a kind of JDBC PreparedStatement, then you can think of the Pattern classes matcher(targetString) method as a kind of SQL SELECT statement. Study the following code:

    // Compile the regex. 
   String regex = "(\w+)@(\w+\.)(\w+)(\.\w+)*";
   Pattern pattern = Pattern.compile(regex);
   // Create the 'target' string we wish to interrogate.
   String targetString = "You can email me at g_andy@example.com or andy@example.net to get more info";
   // Get a Matcher based on the target string.
   Matcher matcher = pattern.matcher(targetString);

   // Find all the matches.
   while (matcher.find()) {
     System.out.println("Found a match: " + matcher.group());
     System.out.println("Start position: " + matcher.start());
     System.out.println("End position: " + matcher.end());
   }

There are a few interesting things going on here. First up, notice that we used the Pattern class’s matcher() method to obtain a Matcher object. This object, still using our SQL analogy, is where the resulting matches are held — think JDBC ResultSet. The records, of course, are the portions of text that matched our regular expression.

The while loop runs conditionally based on the results of the Matcher class’s find() method. This method will parse just enough of our target string to make a match, at which point it will return true. Be careful: any attempts to use the matcher before calling find() will result in the unchecked IllegalStateException being thrown at runtime.

In the body of our while loop we retrieved the matched substring using the Matcher class’s group() method. Our while loop executes twice: once for each email address in our target string. On each occasion, it prints the matched email address, returned by the group() method, and the substring location information. Take a look at the output:

Found a match: g_andy@example.com 
Start position: 20
End position: 38
Found a match: andy@example.net
Start position: 42
End position: 58

As you can see, it was simply a matter of using the Matcher’s start() and end() methods to find out where the matched substrings occurred in the target string. Next up, a closer look at the group() method.

Understanding Groups

As you learned, Matcher.group() will retrieve a complete match from the target string. But what if you were also interested in subsections, or ‘subgroups’ of the matched text? In our email example, it may have been desirable to extract the host name portion of the email address and the username portion. Have a look at a revised version of our Matcher driven while loop:

    while (matcher.find()) { 
     System.out.println("Found a match: " + matcher.group(0) +
                        ". The Username is " +
                        matcher.group(1) + " and the ISP is " +
                        matcher.group(2));
   }

As you may recall, groups are represented as a set of parentheses wrapped around a subsection of your pattern. The first group, located using Matcher.group() or, as in the example, the more specific Matcher.group(0), represents the entire match. Further groups can be found using the same group(int index) method. Here is the output for the above example:

Found a match: g_andy@example.com.. The Username is g_andy and the ISP is example. 
Found a match: andy@example.net.. The Username is andy and the ISP is example.

As you can see, group(1) retrieves the username portion of the email address and group(2) retrieves the ISP portion. When crafting your own regular expressions it is, of course, up to you how you logically subgroup your patterns. A minor oversight in this example is that the period itself is captured as part of the subgroup returned by group(2)!

Keep in mind that subgroups are indexed from left to right based on the order of their opening parentheses. This is particularly important when you are working with groups that are nested within other groups.

A Little More on the Pattern and Matcher Classes

That’s pretty much the core of this very small, yet very capable, Java API. However, there are a few other bits and pieces you should look into once you’ve had chance to experiment with the basics. The Pattern class has a number of flags that you can use as a second argument to its compile() method. For example, you can use Pattern.CASE_INSENSITIVE to tell the regex engine to match ASCII characters regardless of case.

Pattern.MULTILINE is another useful one. You will sometimes want to tell the regex engine that your target string is not a single line of code; rather, it contains several lines that have their own termination characters.
If you need to, you can combine multiple flags by using the java | (vertical bar) operator. For instance, if you wanted to compile a regex with multiline and case insensitivity support, you could do the following:

Pattern.compile(myRegEx, Pattern.CASE_INSENSITIVE | Pattern.MULTILINE );

The Matcher class has a number of interesting methods, too: String replaceAll(String replacementString) and String replaceFirst(String replacementString), in particular, are worth a mention here.

The replaceAll() method takes a replacement string and replaces all matches with it. The replaceFirst() method is very similar but will -- you guessed it -- replace only the first occurrence of a match. Have a look at the following code:

    // Matches 'BBC' words that end with a digit. 
   String thePattern = "bbc\d";
   // Compile regex and switch off case sensitivity.
   Pattern pattern = Pattern.compile(thePattern, Pattern.CASE_INSENSITIVE);
   // The target string.
   String target = "I like to watch bBC1 and BbC2 - I suppose ITV is okay too";
   // Get the Matcher for the target string.
   Matcher matcher = pattern.matcher(target);
   // Blot out all references to the BBC.
   System.out.println(matcher.replaceAll("xxxx") );

Here' the output:

I like to watch xxxx and xxxx - I suppose ITV is okay too
Backreferences

It's worth taking a quick look at another important regex topic: backreferences. Backreferences allow you to access captured subgroups while the regex engine is executing. Basically, this means that you can refer to a subgroup from an earlier part of a match later on in the pattern. Imagine that you needed to inspect a target string for 3-letter words that started and ended with the same letter -- wow, sos, mum, that kind of thing. Here's a pattern that will do the job:

(w)(w)(1)

In this case, the (1) group contains a backreference to the first match made in the pattern. Basically, the third parenthesised group will only match when the character at this position is the same as the character in the first parenthesised group. Of course, you would simply substitute 1 with 2 if you wanted to backreference the second group. It's simple, but in many cases, tremendously useful.

The Matcher object's replacement methods (and the String class's counterparts) also support a notation for doing backreferences in the replacement string. It works in the same way, but uses a dollar sign instead of a backslash. So, matcher.replaceAll("$2") would replace all matches in a target string with the value matched by the second subgroup of the regular expression.

String Class RegEx Methods

As I mentioned earlier, the Java String class has been updated to take advantage of regular expressions. You can, in simple cases, completely bypass using the regex API directly by calling regex enabled methods directly on the String class. There are 5 such methods available.

You can use the boolean matches(String regex) method to quickly determine if a string exactly matches a particular pattern. The appropriately named String replaceFirst(String regex, String replacement) and String replaceAll(String regex, String replacement) methods allow you to do quick and dirty text replacements. And finally, the String[] split(String regEx) and String[] split(String regEx, int limit) methods let you split a string into substrings based on a regular expression. These last two methods are, in concept, similar to the java.util.StringTokenizer, only much more powerful.

Keep in mind that it makes much more sense, in many cases, to use the regex API and a more object oriented approach. One reason for this is that such an approach allows you to precompile your regular expression and then use it across multiple target strings. Another reason is that it is simply much more capable. You will quickly get the hang of when to choose one approach over the other.

Hopefully, I have given you a head start with the regex API and tempted those who are yet to discover this powerful tool to give it some serious consideration. A quick tip: don't waste hours of precious development time trying to craft a complicated regular expression -- it may already exist. There are plenty of places, such as www.regexlib.com, that make a whole bunch of them freely available.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

No Reader comments