It was a long time coming, but the java.util.regex package was a significant and hugely useful addition to Java 1.4. For Web developers constantly dealing with text-based content, this represents a major productivity and efficiency boost. Java regular expressions can be used in client-side Java applets and also in server-side J2EE and JSP code.
Using regular expressions and the regex package, you can easily describe, locate and manipulate complex patterns of text. Trust me, this is definitely a “How did I ever get by without it?” kind of thing.
In this article I’ll explain the general idea behind regular expressions, explain how the java.util.regex package works, then wrap up with a quick look at how the String class has been retrofitted to take advantage of regular expressions.
Before we get into the details of the Java regex API itself, let’s have a quick look at what a regular expression, or, to those in the trade, a ‘regex’, actually is. If you already know what a regular expression is, feel free to skim over this next section.
Key Takeaways
- The `java.util.regex` package significantly enhances text manipulation capabilities in Java, making it indispensable for web developers.
- Regular expressions allow for the description, identification, and alteration of complex text patterns, which can be used effectively in both client-side and server-side Java applications.
- Key classes in the Java Regex API include `Pattern` for compiling regex strings and `Matcher` for performing match operations on text.
- Java strings require special handling in regex due to the escape character, necessitating double backslashes (\\) in regex patterns.
- Practical applications of regex in Java extend to string replacement and validation, simplifying tasks such as email validation and data parsing.
What is a Regular Expression?
A regular expression is a series of metacharacters and literals that allow you to describe substrings in text using a pattern. These metacharacters actually form a miniature language in their own right. In fact, in many ways, you can think of regular expressions as a kind of SQL query for free flowing text. Consider the following sentence:
My name is Will and I live in williamstown.
How could we find all occurrences of the text ‘Will’, regardless of whether or not an upper or lowercase ‘w’ was used? With regular expressions you can describe this requirement by composing a pattern made from a series of metacharacters and literals. Here is such a pattern:
[Ww]ill
This one’s pretty straightforward. The interesting part is the [Ww] grouping — it indicates that any one of the letters enclosed within the brackets (in this case, either an uppercase ‘W’ or a lowercase ‘w’) is acceptable. So, this regular expression will match text that begins with an uppercase or lowercase w
, and is followed by the literals i
, then l
, and then another l
.
Let’s step it up a notch. The above regular expression will actually match 2 occurrences of will
— the name Will
and the first 4 characters of text in williamstown
. We may only have wanted to search for will
and Will
, and not for words that simply contain these 4 characters in sequence. Here’s an improved version:
b[Ww]illb
The b
is how we describe a word boundary. A word boundary will match the likes of spaces, tabs, and the beginning and end points of a line. This effectively rules out williamstown
as a match because the second l
in williamtown
is not followed by a word boundary — it’s followed by an i
.
I could dedicate a whole article to the fine art of crafting regular expressions, but my focus here is on the Java regular expression package itself. So, let’s examine one more regular expression — we’ll stick with this one throughout the rest of the article.
(w+)@(w+.)(w+)(.w+)?
Let’s take a divide-and-conquer approach to analyzing this pattern. The (w+)
grouping (it appears twice — examine the one at the start) looks for word characters, as denoted by the w
. The +
indicates that one or more word characters must appear (not necessarily the same one). This must be followed by a literal @
character. The parentheses are not actually required here, but they do divide the expression into groupings, and you’ll soon see that forming logical groupings in this manner can be extremely useful.
Based on this first portion of our example regex, the (w+)@
portion, here are a few examples that meet the requirements so far:
billy@
joe@
francisfordcoppola@
Let’s move along to the next portion. The (w+.)
grouping is similar, but expects a period to follow in order to make a match. The period has been escaped using a backslash because the period character is itself a regex meta-character (a wildcard that matches any character). You must always escape metacharacters in this way if you want to match on their literal meaning.
Let’s take a look at a few examples that would meet the requirements so far:
billy@webworld.
joe@optus.
francisfordcoppola@myisp.
The (w+)
grouping is identical to the first grouping — it looks for one or more word characters. So, as you’ve no doubt realised already, our regular expression is intended to match email addresses.
A few examples that meet the requirements so far:
billy@webworld.com
joe@optus.net
francisfordcoppola@myisp.com
We’re nearly there. The (.w+)*
grouping should mostly make sense at this point — we’re looking for a period followed by one or more word characters. But what’s with the *
after the closing parentheses? In the world of regular expressions, we use *
to denote that the preceding metacharacter, literal or group can occur zero or more times. As an example, wd*
would match a word character followed by zero or more digits. In our example, we use parentheses to group together a series of metacharacters, so the *
applies to the whole group. So, you can interpret (.w+)*
as ‘match a period followed by one or more word characters, and match that combination zero or more times’.
A few examples that meet the requirements of the complete regular expression:
fred@vianet.com
barney@comcorp.net.au
wilma@mjinteractive.iinet.net.au
With our regular expression crafted, it’s time to move on to the Java side of things. The very first thing you will need to know is how to combat the rather unfortunate syntax clash between Java strings and regular expressions. It’s a clash that you, the developer, must deal with.
Java Safe Regular Expressions
It’s slightly annoying, but the fact remains that you will need to make your regular expressions safe for use in Java code. This means that any backslash delimited metacharacters will need to be escaped. This is because the backslash character has its own special meaning in Java. So, our example email address regex would have to be rewritten as follows:
String emailRegEx = "(\w+)@(\w+\.)(\w+)(\.\w+)*";
Keep in mind that if you actually need to match against a literal backslash, you must double up yet again. It can be more difficult to read a Java safe regex, so you may want first to craft a ‘regular’ regular expression (a regregex perhaps?) and keep a copy handy — perhaps inside a code comment.
So, how do we use all this to achieve something useful? In certain situations, you can simply call methods such as replace()
and replaceAll()
directly on the String
class — we’ll take a quick look at this approach later. However, for more sophisticated regex operatations, you will be far better served by taking a more object oriented approach.
The Pattern Class
Here’s something refreshing: the java.util.regex package only contains three classes — and one of those is an exception! As you would expect, this makes for a very easy-to-learn API. Here are the 3 steps you would generally follow to use the regex package:
- Compile your regex string using the Pattern class.
- Use the Pattern class to get a Matcher object.
- Call methods on the Matcher to get at any matches.
We will look at the Matcher class next, but let’s dive in with a look at the Pattern class. This class lets you compile your regular expression — this effectively optimises it for efficiency and use by multiple target strings (strings which you want to test the compiled regular expression against). Consider the following example:
String emailRegEx = "(\w+)@(\w+\.)(\w+)(\.\w+)*";
// Compile and get a reference to a Pattern object.
Pattern pattern = Pattern.compile(emailRegEx);
// Get a matcher object - we cover this next.
Matcher matcher = pattern.matcher(emailRegEx);
Take note that the Pattern object was retrieved via the Pattern class’s static compile method — you cannot instantiate a Pattern object using new
. Once you have a Pattern object you can use it to get a reference to a Matcher object. We look at Matcher next.
The Matcher Class
Earlier, I suggested that regular expressions are a kind of SQL query for free flowing text. The analogy is not entirely perfect, but when using the regex API it can help to think along these lines. If you think of Pattern.compile(myRegEx)
as being a kind of JDBC PreparedStatement, then you can think of the Pattern classes matcher(targetString)
method as a kind of SQL SELECT statement. Study the following code:
// Compile the regex.
String regex = "(\w+)@(\w+\.)(\w+)(\.\w+)*";
Pattern pattern = Pattern.compile(regex);
// Create the 'target' string we wish to interrogate.
String targetString = "You can email me at g_andy@example.com or andy@example.net to get more info";
// Get a Matcher based on the target string.
Matcher matcher = pattern.matcher(targetString);
// Find all the matches.
while (matcher.find()) {
System.out.println("Found a match: " + matcher.group());
System.out.println("Start position: " + matcher.start());
System.out.println("End position: " + matcher.end());
}
There are a few interesting things going on here. First up, notice that we used the Pattern class’s matcher()
method to obtain a Matcher object. This object, still using our SQL analogy, is where the resulting matches are held — think JDBC ResultSet. The records, of course, are the portions of text that matched our regular expression.
The while
loop runs conditionally based on the results of the Matcher class’s find()
method. This method will parse just enough of our target string to make a match, at which point it will return true. Be careful: any attempts to use the matcher before calling find()
will result in the unchecked IllegalStateException
being thrown at runtime.
In the body of our while loop we retrieved the matched substring using the Matcher class’s group()
method. Our while loop executes twice: once for each email address in our target string. On each occasion, it prints the matched email address, returned by the group()
method, and the substring location information. Take a look at the output:
Found a match: g_andy@example.com
Start position: 20
End position: 38
Found a match: andy@example.net
Start position: 42
End position: 58
As you can see, it was simply a matter of using the Matcher’s start()
and end()
methods to find out where the matched substrings occurred in the target string. Next up, a closer look at the group()
method.
Understanding Groups
As you learned, Matcher.group()
will retrieve a complete match from the target string. But what if you were also interested in subsections, or ‘subgroups’ of the matched text? In our email example, it may have been desirable to extract the host name portion of the email address and the username portion. Have a look at a revised version of our Matcher driven while loop:
while (matcher.find()) {
System.out.println("Found a match: " + matcher.group(0) +
". The Username is " +
matcher.group(1) + " and the ISP is " +
matcher.group(2));
}
As you may recall, groups are represented as a set of parentheses wrapped around a subsection of your pattern. The first group, located using Matcher.group()
or, as in the example, the more specific Matcher.group(0)
, represents the entire match. Further groups can be found using the same group(int index)
method. Here is the output for the above example:
Found a match: g_andy@example.com.. The Username is g_andy and the ISP is example.
Found a match: andy@example.net.. The Username is andy and the ISP is example.
As you can see, group(1)
retrieves the username portion of the email address and group(2)
retrieves the ISP portion. When crafting your own regular expressions it is, of course, up to you how you logically subgroup your patterns. A minor oversight in this example is that the period itself is captured as part of the subgroup returned by group(2)
!
Keep in mind that subgroups are indexed from left to right based on the order of their opening parentheses. This is particularly important when you are working with groups that are nested within other groups.
A Little More on the Pattern and Matcher Classes
That’s pretty much the core of this very small, yet very capable, Java API. However, there are a few other bits and pieces you should look into once you’ve had chance to experiment with the basics. The Pattern class has a number of flags that you can use as a second argument to its compile()
method. For example, you can use Pattern.CASE_INSENSITIVE
to tell the regex engine to match ASCII characters regardless of case.
Pattern.MULTILINE
is another useful one. You will sometimes want to tell the regex engine that your target string is not a single line of code; rather, it contains several lines that have their own termination characters.
If you need to, you can combine multiple flags by using the java|
(vertical bar) operator. For instance, if you wanted to compile a regex with multiline and case insensitivity support, you could do the following:
Pattern.compile(myRegEx, Pattern.CASE_INSENSITIVE | Pattern.MULTILINE );
The Matcher class has a number of interesting methods, too: String replaceAll(String replacementString)
and String replaceFirst(String replacementString)
, in particular, are worth a mention here.
The replaceAll()
method takes a replacement string and replaces all matches with it. The replaceFirst()
method is very similar but will — you guessed it — replace only the first occurrence of a match. Have a look at the following code:
// Matches 'BBC' words that end with a digit.
String thePattern = "bbc\d";
// Compile regex and switch off case sensitivity.
Pattern pattern = Pattern.compile(thePattern, Pattern.CASE_INSENSITIVE);
// The target string.
String target = "I like to watch bBC1 and BbC2 - I suppose ITV is okay too";
// Get the Matcher for the target string.
Matcher matcher = pattern.matcher(target);
// Blot out all references to the BBC.
System.out.println(matcher.replaceAll("xxxx") );
Here’ the output:
I like to watch xxxx and xxxx - I suppose ITV is okay too
Backreferences
It’s worth taking a quick look at another important regex topic: backreferences. Backreferences allow you to access captured subgroups while the regex engine is executing. Basically, this means that you can refer to a subgroup from an earlier part of a match later on in the pattern. Imagine that you needed to inspect a target string for 3-letter words that started and ended with the same letter — wow, sos, mum, that kind of thing. Here’s a pattern that will do the job:
(w)(w)(1)
In this case, the (1)
group contains a backreference to the first match made in the pattern. Basically, the third parenthesised group will only match when the character at this position is the same as the character in the first parenthesised group. Of course, you would simply substitute 1
with 2
if you wanted to backreference the second group. It’s simple, but in many cases, tremendously useful.
The Matcher object’s replacement methods (and the String class’s counterparts) also support a notation for doing backreferences in the replacement string. It works in the same way, but uses a dollar sign instead of a backslash. So, matcher.replaceAll("$2")
would replace all matches in a target string with the value matched by the second subgroup of the regular expression.
String Class RegEx Methods
As I mentioned earlier, the Java String class has been updated to take advantage of regular expressions. You can, in simple cases, completely bypass using the regex API directly by calling regex enabled methods directly on the String class. There are 5 such methods available.
You can use the boolean matches(String regex)
method to quickly determine if a string exactly matches a particular pattern. The appropriately named String replaceFirst(String regex, String replacement)
and String replaceAll(String regex, String replacement)
methods allow you to do quick and dirty text replacements. And finally, the String[] split(String regEx)
and String[] split(String regEx, int limit)
methods let you split a string into substrings based on a regular expression. These last two methods are, in concept, similar to the java.util.StringTokenizer
, only much more powerful.
Keep in mind that it makes much more sense, in many cases, to use the regex API and a more object oriented approach. One reason for this is that such an approach allows you to precompile your regular expression and then use it across multiple target strings. Another reason is that it is simply much more capable. You will quickly get the hang of when to choose one approach over the other.
Hopefully, I have given you a head start with the regex API and tempted those who are yet to discover this powerful tool to give it some serious consideration. A quick tip: don’t waste hours of precious development time trying to craft a complicated regular expression — it may already exist. There are plenty of places, such as www.regexlib.com, that make a whole bunch of them freely available.
Frequently Asked Questions about Java Regex API
What is the basic syntax for Java Regex API?
The Java Regex API is a powerful tool for manipulating strings. The basic syntax involves creating a Pattern object from a regular expression string, and then creating a Matcher object from the Pattern object and the string you want to search. Here’s a simple example:Pattern pattern = Pattern.compile("regex");
Matcher matcher = pattern.matcher("string to search");
In this example, “regex” is the regular expression you want to search for, and “string to search” is the string you’re searching in.
How can I use Java Regex API for string matching?
The Matcher object provides several methods for string matching. The matches()
method returns true if the entire string matches the regular expression. The find()
method returns true if a substring matches the regular expression. The group()
method returns the matched substring. Here’s an example:Pattern pattern = Pattern.compile("a*b");
Matcher matcher = pattern.matcher("aaab");
boolean matches = matcher.matches(); // returns true
boolean found = matcher.find(); // returns true
String group = matcher.group(); // returns "aaab"
What are some common special characters in Java Regex API?
Java Regex API uses several special characters, also known as metacharacters, to define regular expressions. Some common ones include:
- 1.
.
: Matches any single character except newline.- 1.
*
: Matches zero or more occurrences of the preceding character or group.- 1.
+
: Matches one or more occurrences of the preceding character or group.- 1.
?
: Matches zero or one occurrence of the preceding character or group.- 1.
[]
: Defines a character class, matching any single character within the brackets.- 1.
()
: Defines a group.- 1.
^
: Matches the start of a line.- 1. `=: Matches the end of a line.
How can I use Java Regex API to replace strings?
The Matcher object provides the replaceAll()
and replaceFirst()
methods for replacing strings. The replaceAll()
method replaces all occurrences of the regular expression with a replacement string, while the replaceFirst()
method replaces only the first occurrence. Here’s an example:Pattern pattern = Pattern.compile("dog");
Matcher matcher = pattern.matcher("The dog is a dog.");
String replaced = matcher.replaceAll("cat"); // returns "The cat is a cat."
How can I use Java Regex API to split strings?
The Pattern object provides the split()
method for splitting strings. This method splits the string around matches of the regular expression. Here’s an example:Pattern pattern = Pattern.compile("\\s+");
String[] words = pattern.split("One two three"); // returns ["One", "two", "three"]
In this example, “\s+” is a regular expression that matches one or more whitespace characters.
What is the difference between greedy, reluctant, and possessive quantifiers in Java Regex API?
Greedy, reluctant, and possessive are types of quantifiers in Java Regex API. A greedy quantifier matches as many occurrences as possible, a reluctant quantifier matches as few occurrences as possible, and a possessive quantifier also matches as many occurrences as possible but does not give up matches in case of a non-match. Here’s an example that illustrates the difference:Pattern pattern = Pattern.compile("a*");
Matcher matcher = pattern.matcher("aaa");
matcher.find(); // returns "aaa" (greedy)
pattern = Pattern.compile("a*?");
matcher = pattern.matcher("aaa");
matcher.find(); // returns "" (reluctant)
pattern = Pattern.compile("a*+");
matcher = pattern.matcher("aaa");
matcher.find(); // returns "aaa" (possessive)
How can I use Java Regex API to validate user input?
You can use the matches()
method of the Matcher object to validate user input. For example, you can check if an email address is valid with the following code:Pattern pattern = Pattern.compile("^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,6}$");
Matcher matcher = pattern.matcher("user@example.com");
boolean isValid = matcher.matches(); // returns true if the email is valid
How can I use Java Regex API to extract information from strings?
You can use the group()
method of the Matcher object to extract information from strings. For example, you can extract the domain from an email address with the following code:Pattern pattern = Pattern.compile("@(.+)");
Matcher matcher = pattern.matcher("user@example.com");
if (matcher.find()) {
String domain = matcher.group(1); // returns "example.com"
}
How can I use Java Regex API to search and replace strings with dynamic content?
You can use the appendReplacement()
and appendTail()
methods of the Matcher object to search and replace strings with dynamic content. Here’s an example:Pattern pattern = Pattern.compile("dog");
Matcher matcher = pattern.matcher("The dog is a dog.");
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, "cat");
}
matcher.appendTail(sb);
String replaced = sb.toString(); // returns "The cat is a cat."
How can I use Java Regex API to parse log files?
You can use the find()
and group()
methods of the Matcher object to parse log files. For example, you can extract the date and time from a log entry with the following code:Pattern pattern = Pattern.compile("(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2})");
Matcher matcher = pattern.matcher("2019-01-01 12:34:56 INFO Starting application");
if (matcher.find()) {
String dateTime = matcher.group(1); // returns "2019-01-01 12:34:56"
}
Andy is an independent Java and ColdFusion programmer who lives in Perth, Western Australia. He is also a Macromedia Certified Instructor for Desktop Applications, one of Perth's largest providers of Macromedia-based training.