Episode 2: Real-world regular expressions

Jacob Kaplan-Moss
Jacob Kaplan-Moss

Let’s get this out there right off the bat: I love regular expressions. Really, I do — they’re the Swiss Army Knife of text processing, and no respecting developer can go long without needing ’em.

Of course, we all also know how dangerous they can be. As always, with great power comes great responsibility.

Still, if you know how — and when — and why — to use regular expressions, they’re indispensable. So this week, regular expressions will be our theme.

Below are five regular expressions. Each one of them matches a real-world string; that is, a semi-structured piece of text you might want to pull out of a greater document. Here’s an example question to give you an idea what I mean:

  1. [0-9]{5}

This, of course, is a US ZIP code.

So, what “things” do these regular expressions match? We’ll assume for this quiz that the regex engine is running in case-insensitive mode:

  1. [A-PR-Y0-9]{3}-[A-PR-Y0-9]{3}-[A-PR-Y0-9]{4}
  2. &(?!(w+|#d+);)
  3. (-?(?:0|[1-9]d*))(.d+)?([eE][-+]?d+)?
  4. ([da-f]{2}:){5}([da-f]{2})
  5. <[^>]*?>

Of course, since we’re dealing with regular expressions here, I’d be amiss if I didn’t give you two problems for the price of one.

In each case, the regular expression has something wrong with it. For example, the ZIP code regex above doesn’t correctly match the ZIP+4 format (i.e. 66044-0034) that’s used for many addresses these days.

So, for part two, what’s wrong with the rest of ’em?

Enjoy your Thanksgiving belly-stuffing, and tune in over the weekend for the answers.