Answers to Episode 2 (Real-life regular expressions)

Tweet

Yeah, I’m a little late getting these answers posted. Sorry!

If you missed it, last week’s challenge dealt with deciphering regular expressions and finding subtle bugs within ‘em.

As with last week, before getting to the actual answers please indulge while I pontificate a bit:

Hopefully it’s pretty obvious that regular expressions are a double-edged sword. Sure, deciphering them makes a fun quiz, but imagine running across these monsters in code and trying to figure out what they do… not fun.

Fortunately, nearly every regex implementation has a “verbose” mode that allows you to embed comments inside regular expressions (n most languages this is the x flag). For the sake of those who must read your code, please use the verbose mode!

OK, on to the answers:

1. [A-PR-Y0-9]{3}-[A-PR-Y0-9]{3}-[A-PR-Y0-9]{4}

This is a US phone number, including ones that use letters (i.e. 831-555-CODE). Rewritten in verbose mode, it makes a lot more sense:

  [A-PR-Y0-9]{3}  # Area code prefix
  -
  [A-PR-Y0-9]{3}  # 3-digit exchange
  -
  [A-PR-Y0-9]{4}  # 4-digit suffix

birman had a nice roundup of the problems with this pattern:

[It] doesn’t account for a preceding 1, if the area code is in parenthesis, if the digit groups are separated by a dot or space instead of a dash, or the fact that cell phones have Q and Z on them. It also doesn’t make sure the group is isolated, and not part of something like 1234888-234-123456123.

That last point — the isolation error — is a very common error when writing regular expressions.

2. &(?!(w+|#d+);)

This is not, as most people thought, a mistaken attempt to match HTML entities. It’s actually a pattern that will match ampersands in HTML that are not part of entities (it’s taken from Django’s fix_ampersands template filter).

Here’s the verbose mode:

  &     # Match an ampersand...
  (?!       # ... that is *not* followed by...
    (
      w+   # ... word characters...
      |     # ... or...
      #d+ # ... numeric entity symbols...
    )
    ;       # ... and a semi-colon.
  )

The “problem” with this pattern is pretty subtle: it matches HTML entities that are well-formed by still invalid (e.g. &#ggxy;). So as a way of finding unencoded ampersands it’s just fine, but if you wanted to use it as part of an HTML validator, it would be unacceptable.

3. (-?(?:0|[1-9]d*))(.d+)?([eE][-+]?d+)?

Most readers got this one; it’s a IEEE floating point number, with optional exponent. In verbose mode:

  (             # The non-fractional part of the base
    -?            # could be a leading negative sign 
    (?:           # Non-matching group...
      0|[1-9]d*  # 0, or multiple digits
    )  
  )
  (.d+)?      # Decimal point and fractional part of the base
  (             # Exponent
    [eE]          # 
    [-+]?         #  > "e", plus or minus, exponent.
    d+           # /
  )?

Some readers thought the d in the base part was a bug; it’s not, actually — that expression matches either 0, or a number that starts with 1-9 and then contains any digits.

The actual bug is that this pattern matches non-normalized numbers (i.e. 123.45e3, which should more properly be written 1.2345e5).

4. ([da-f]{2}:){5}([da-f]{2})

Nearly everyone got this one: it’s a MAC address:

  ([da-f]{2}:){5}  # Two hex digits followed by a colon, x5
  ([da-f]{2})      # Two hex digits to end.

As birman noted, this pattern fails to match a few other forms allowed for MAC addresses; they can be written with hyphens (12-34-56-78-9A-BC), or as dotted quads (1234.5678.9ABC).

5. <[^>]*?>

This one also seemed to be easy for most readers; it matches any SGML tag. In verbose syntax:

  <        # Atart the tag
  [^>]*?   # Any non-gt character
  >        # End the tag

The “bug” in this one is a little more abstract: malformed SGML/HTML will severely muck it up. I’ll leave finding such code an exercise for the reader, though.

Next time

Tune in tomorrow for the next installment of the quiz. This week’s question will be a “things that every web developer should know” quiz; I think it’s a lot of fun.

See you tomorrow!

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • malikyte

    …hopefully I can actually catch the next question before it’s practically already all answered. :) Thanks again, Jacob! This is great stuff!

  • http://doitslower.com/ lartexpert

    Why do I miss things like this until they’ve already been and gone? Oh well…

    Couple of things though:
    1) US Phone Numbers
    IANAA (I am not an American…) If isolating the expression so that it doesn’t feature as part of a longer sequence, why not just put b at the start and end of the regex so that it has to have a word boundary each end? Assuming that neither area codes nor exchange codes can start with a zero, you could restrict it a little more with
    [A-PR-Y1-9][A-PR-Y0-9]{2}-[A-PR-Y1-9][A-PR-Y0-9]{2}-[A-PR-Y0-9]{4}

    2) Non-entity ampersands
    You could probably tighten it up a little, by using [a-z] instead of w – I don’t know for certain, but I *think* named entities are always a-z chars, not underscores, digits, etc. Also, it might be possible to restrict the number of digits in the number entity version. Overall, though, it’s a classic example of why parsing HTML with regexes is a Bad Thing(tm)

    3) Floating point numbers
    You could avoid the problem of passing things like 123.4567e4 by just removign the first d* from the expression – maybe that’s what people meant when they said it’s a bug? Also, what was the reason for capturing the part that comes before the decimal point? Leaving non-capturing groups aside, what’s the benefit of putting this part in brackets?

    4) MAC addresses
    If you disregard the other forms for a MAC address, one severe failing of this regex is that the {5} quantifier means that the regex will only match MAC addresses that have the same octet five times, e.g. AB:AB:AB:AB:AB:CD – maybe it would have been better to do something like
    ([da-f]{2}:[da-f]{2}:[da-f]{2}:[da-f]{2}:[da-f]{2}:[da-f]{2})
    or
    (([da-f]{2}):([da-f]{2}):([da-f]{2}):([da-f]{2}):([da-f]{2}))
    if you want to capture each octet separately. There’s also the POSIX character class [:xdigit:] for hex digits, but some people find [da-f] easier to read!

    5) SGML elements
    ]*?> will also match an empty element, like , since * will also match zero occurences – maybe better to have ]+> … You don’t need the non-greedy ? as it won’t change what the regex matches. You could look at restricting things further, maybe elements using a-z, etc, although again there’s the trying-to-validate-sgml-with-regex problem again!

  • http://doitslower.com/ lartexpert

    Whoops! Thanks to the joy of CMS markup, that didn’t quite come out right…

    For the SGML bit, rather than using <[^>]*?> it would be better to have <[^>]+> to avoid matching empty elements like <>

    Crossing fingers that this time it will get through the markup engine ;-)

  • cnun

    thanks for the information :)