Answers to Episode 2 (Real-life regular expressions)

Jacob Kaplan-Moss

Yeah, I’m a little late getting these answers posted. Sorry!

If you missed it, last week’s challenge dealt with deciphering regular expressions and finding subtle bugs within ’em.

As with last week, before getting to the actual answers please indulge while I pontificate a bit:

Hopefully it’s pretty obvious that regular expressions are a double-edged sword. Sure, deciphering them makes a fun quiz, but imagine running across these monsters in code and trying to figure out what they do… not fun.

Fortunately, nearly every regex implementation has a “verbose” mode that allows you to embed comments inside regular expressions (n most languages this is the x flag). For the sake of those who must read your code, please use the verbose mode!

OK, on to the answers:

1. [A-PR-Y0-9]{3}-[A-PR-Y0-9]{3}-[A-PR-Y0-9]{4}

This is a US phone number, including ones that use letters (i.e. 831-555-CODE). Rewritten in verbose mode, it makes a lot more sense:

  [A-PR-Y0-9]{3}  # Area code prefix
  [A-PR-Y0-9]{3}  # 3-digit exchange
  [A-PR-Y0-9]{4}  # 4-digit suffix

birman had a nice roundup of the problems with this pattern:

[It] doesn’t account for a preceding 1, if the area code is in parenthesis, if the digit groups are separated by a dot or space instead of a dash, or the fact that cell phones have Q and Z on them. It also doesn’t make sure the group is isolated, and not part of something like 1234888-234-123456123.

That last point — the isolation error — is a very common error when writing regular expressions.

2. &(?!(w+|#d+);)

This is not, as most people thought, a mistaken attempt to match HTML entities. It’s actually a pattern that will match ampersands in HTML that are not part of entities (it’s taken from Django’s fix_ampersands template filter).

Here’s the verbose mode:

  &     # Match an ampersand...
  (?!       # ... that is *not* followed by...
      w+   # ... word characters...
      |     # ... or...
      #d+ # ... numeric entity symbols...
    ;       # ... and a semi-colon.

The “problem” with this pattern is pretty subtle: it matches HTML entities that are well-formed by still invalid (e.g. &#ggxy;). So as a way of finding unencoded ampersands it’s just fine, but if you wanted to use it as part of an HTML validator, it would be unacceptable.

3. (-?(?:0|[1-9]d*))(.d+)?([eE][-+]?d+)?

Most readers got this one; it’s a IEEE floating point number, with optional exponent. In verbose mode:

  (             # The non-fractional part of the base
    -?            # could be a leading negative sign 
    (?:           # Non-matching group...
      0|[1-9]d*  # 0, or multiple digits
  (.d+)?      # Decimal point and fractional part of the base
  (             # Exponent
    [eE]          # 
    [-+]?         #  > "e", plus or minus, exponent.
    d+           # /

Some readers thought the d in the base part was a bug; it’s not, actually — that expression matches either 0, or a number that starts with 1-9 and then contains any digits.

The actual bug is that this pattern matches non-normalized numbers (i.e. 123.45e3, which should more properly be written 1.2345e5).

4. ([da-f]{2}:){5}([da-f]{2})

Nearly everyone got this one: it’s a MAC address:

  ([da-f]{2}:){5}  # Two hex digits followed by a colon, x5
  ([da-f]{2})      # Two hex digits to end.

As birman noted, this pattern fails to match a few other forms allowed for MAC addresses; they can be written with hyphens (12-34-56-78-9A-BC), or as dotted quads (1234.5678.9ABC).

5. <[^>]*?>

This one also seemed to be easy for most readers; it matches any SGML tag. In verbose syntax:

  <        # Atart the tag
  [^>]*?   # Any non-gt character
  >        # End the tag

The “bug” in this one is a little more abstract: malformed SGML/HTML will severely muck it up. I’ll leave finding such code an exercise for the reader, though.

Next time

Tune in tomorrow for the next installment of the quiz. This week’s question will be a “things that every web developer should know” quiz; I think it’s a lot of fun.

See you tomorrow!