Let’s get this out there right off the bat: I love regular expressions. Really, I do — they’re the Swiss Army Knife of text processing, and no respecting developer can go long without needing ‘em.
Of course, we all also know how dangerous they can be. As always, with great power comes great responsibility.
Still, if you know how — and when — and why — to use regular expressions, they’re indispensable. So this week, regular expressions will be our theme.
Below are five regular expressions. Each one of them matches a real-world string; that is, a semi-structured piece of text you might want to pull out of a greater document. Here’s an example question to give you an idea what I mean:
[0-9]{5}
This, of course, is a US ZIP code.
So, what “things” do these regular expressions match? We’ll assume for this quiz that the regex engine is running in case-insensitive mode:
[A-PR-Y0-9]{3}-[A-PR-Y0-9]{3}-[A-PR-Y0-9]{4}&(?!(\w+|#\d+);)(-?(?:0|[1-9]\d*))(\.\d+)?([eE][-+]?\d+)?([\da-f]{2}:){5}([\da-f]{2})<[^>]*?>
Of course, since we’re dealing with regular expressions here, I’d be amiss if I didn’t give you two problems for the price of one.
In each case, the regular expression has something wrong with it. For example, the ZIP code regex above doesn’t correctly match the ZIP+4 format (i.e. 66044-0034) that’s used for many addresses these days.
So, for part two, what’s wrong with the rest of ‘em?
Enjoy your Thanksgiving belly-stuffing, and tune in over the weekend for the answers.
Related posts:
- Web Fonts Get Real with Typekit Web designers want access to a greater range of fonts...
- Create a Buzz: Grassroots Viral Marketing For Regular People Viral marketing is an amazing phenomenon, although it can be...
- Google I/O a Real Eye-Opener! Lots of impressive technology has been demonstrated at the recent...
- Is the World Ready for Video Ads in Magazines? Last week, a video ad appeared in a print publication....
- Dreamweaver CS4: A Powerful Tool for an Imperfect World With the just-released Dreamweaver CS4, Adobe has conceded that web...







1. I’m not from the states so am unsure, but is it a social security number? If so, then the second sequence should only contain a repitition of 2.
[A-PR-Y0-9]{3}-[A-PR-Y0-9]{2}-[A-PR-Y0-9]{4}
2. Not a clue about this one! An optional ampersand, followed by an exclamation mark, followed by either one or more words (alphanumerics) or a hash followed by one or more digits, terminated with a semi-colon.
3. I think this is a number expressed in scientific notation, e.g. 1.3e10 – However, I don’t think the \d is necessary in “[1-9]\d*”
4. This looks like a MAC address. Six hex numbers separated by colons. However, I can’t see anything wrong with it so I’m probably wrong!
5. This looks like it would match an SGML tag of some sort, although it’s not very specific. It will match anything starting with a ” followed by a closing ‘>’.
November 22nd, 2006 at 3:42 am
Edit… 5. should read “… anything starting with a ‘\
November 22nd, 2006 at 3:45 am
Number 2 appears to be matching html character codes (e.g. or ©). The ! is negative lookahead assertion which should be removed to have it work correctly.
November 22nd, 2006 at 5:13 am
1. Telephone numbers, including those represented as letters.
2. HTML character entity encodings in hexadecimal form.
3. Numbers in scientific/engineering notation.
4. IPv6 IP addresses
5. HTML/XML/XHTML Tags. Any markup, essentially, where the tags use open/closing angle brackets.
November 22nd, 2006 at 5:17 am
This is a fun way to learn a bit more about regex.
November 22nd, 2006 at 6:47 am
I agree with Larry, except I think #4 is a MAC address, not an IP address (like mmanders suggested)
As for the “errors”:
1. US phone number — doesn’t account for a preceding 1, if the area code is in parenthesis, if the digit groups are separated by a dot or space instead of a dash, or the fact that cell phones have Q and Z on them. It also doesn’t make sure the group is isolated, and not part of something like 1234888-234-123456123.
2. HTML character and entity references — should be a positive look-ahead, not a negative one, and instead of a word character it should be a-z since entity references don’t have ‘_’
3. numbers in exponential notation (with ‘1.2354 e10′ style exponent) — I believe that exponential notation only has one digit before the decimal, so the \d* should be dropped and a negative lookbehind added to ensure a single digit. Also, there could be a space between the digits and the exponent.
4. MAC address — doesn’t allow for digit groups to be delineated by hyphens. Because : counts as a non-word character it’s not as easy as putting a word boundary on either side.
There’s also a type of MAC address format like 0123.4567.89ab, so you could more precisely do this:
5. XML style markup tags — shouldn’t have an asterisk for the contents, since that could also match an empty <>. And having the non-greedy * followed by a >, and a match for any character but > were accomplishing the same thing so I dropped one.
Note: I’m assuming these will be processed as case-insensitive, or else there’s a whole new set of problems… There are a million different ways to do regex, so these are just my suggestions — I’m sure there are better ways.
This was fun!
November 22nd, 2006 at 8:44 am
Hi birnam,
I agree with your disagreement with me on item 4. :) I jumped the gun on the number of groups. Good job on the explanations/corrections, too.
November 22nd, 2006 at 11:54 am
bah. Another US specific one :(
November 22nd, 2006 at 10:47 pm
Number 2 searches for any occurence of an ampersand (&) that does NOT appear to be the beginning of a named or numeric entity.
It may be useful if you need to find what appear to be unescaped ampersands in a string.
“This & that” would match
“This & that” would not match
It doesn’t take into account the validity of the entity reference, and doesn’t account for numeric character entities in hexadecimal form.
November 23rd, 2006 at 9:11 am
Nice collection!
Here is a good example which late you how to implement Regular Expressions with .net for U.S. Social Security Numbers.
^((?!000)([0-6]\d{2}|[0-7]{2}[0-2]))-((?!00)\d{2})-((?!0000)\d{4})$Also get complete code for ASP.NET, VB.NET and C#.NET
Please check:- http://www.tipsntracks.com/98/regular-expressions-with-net-us-social-security-numbers.html
April 17th, 2009 at 6:17 pm