Episode 2: Real-world regular expressions

Let’s get this out there right off the bat: I love regular expressions. Really, I do — they’re the Swiss Army Knife of text processing, and no respecting developer can go long without needing ‘em.

Of course, we all also know how dangerous they can be. As always, with great power comes great responsibility.

Still, if you know how — and when — and why — to use regular expressions, they’re indispensable. So this week, regular expressions will be our theme.

Below are five regular expressions. Each one of them matches a real-world string; that is, a semi-structured piece of text you might want to pull out of a greater document. Here’s an example question to give you an idea what I mean:

  1. [0-9]{5}

This, of course, is a US ZIP code.

So, what “things” do these regular expressions match? We’ll assume for this quiz that the regex engine is running in case-insensitive mode:

  1. [A-PR-Y0-9]{3}-[A-PR-Y0-9]{3}-[A-PR-Y0-9]{4}
  2. &(?!(w+|#d+);)
  3. (-?(?:0|[1-9]d*))(.d+)?([eE][-+]?d+)?
  4. ([da-f]{2}:){5}([da-f]{2})
  5. <[^>]*?>

Of course, since we’re dealing with regular expressions here, I’d be amiss if I didn’t give you two problems for the price of one.

In each case, the regular expression has something wrong with it. For example, the ZIP code regex above doesn’t correctly match the ZIP+4 format (i.e. 66044-0034) that’s used for many addresses these days.

So, for part two, what’s wrong with the rest of ‘em?

Enjoy your Thanksgiving belly-stuffing, and tune in over the weekend for the answers.

Win an Annual Membership to Learnable,

SitePoint's Learning Platform

  • mmanders

    1. I’m not from the states so am unsure, but is it a social security number? If so, then the second sequence should only contain a repitition of 2.
    [A-PR-Y0-9]{3}-[A-PR-Y0-9]{2}-[A-PR-Y0-9]{4}

    2. Not a clue about this one! An optional ampersand, followed by an exclamation mark, followed by either one or more words (alphanumerics) or a hash followed by one or more digits, terminated with a semi-colon.

    3. I think this is a number expressed in scientific notation, e.g. 1.3e10 – However, I don’t think the d is necessary in “[1-9]d*”

    4. This looks like a MAC address. Six hex numbers separated by colons. However, I can’t see anything wrong with it so I’m probably wrong!

    5. This looks like it would match an SGML tag of some sort, although it’s not very specific. It will match anything starting with a ” followed by a closing ‘>’.

  • mmanders

    Edit… 5. should read “… anything starting with a ‘

  • dix

    Number 2 appears to be matching html character codes (e.g. &nbsp; or ©). The ! is negative lookahead assertion which should be removed to have it work correctly.

  • Larry

    1. Telephone numbers, including those represented as letters.
    2. HTML character entity encodings in hexadecimal form.
    3. Numbers in scientific/engineering notation.
    4. IPv6 IP addresses
    5. HTML/XML/XHTML Tags. Any markup, essentially, where the tags use open/closing angle brackets.

  • dev_cw

    This is a fun way to learn a bit more about regex.

  • birnam

    I agree with Larry, except I think #4 is a MAC address, not an IP address (like mmanders suggested)

    As for the “errors”:

    1. US phone number — doesn’t account for a preceding 1, if the area code is in parenthesis, if the digit groups are separated by a dot or space instead of a dash, or the fact that cell phones have Q and Z on them. It also doesn’t make sure the group is isolated, and not part of something like 1234888-234-123456123.

     b((1[-. ])?(?[a-z0-9]{3})?[-. ][a-z0-9]{3}[-.][a-z0-9]{4})b

    2. HTML character and entity references — should be a positive look-ahead, not a negative one, and instead of a word character it should be a-z since entity references don’t have ‘_’

    &(?=([a-z]+|#d+);)

    3. numbers in exponential notation (with ‘1.2354 e10′ style exponent) — I believe that exponential notation only has one digit before the decimal, so the d* should be dropped and a negative lookbehind added to ensure a single digit. Also, there could be a space between the digits and the exponent.

    b(?<!d)(-?[0-9])(.d+)?s?([eE][-+ ]?d+)b

    4. MAC address — doesn’t allow for digit groups to be delineated by hyphens. Because : counts as a non-word character it’s not as easy as putting a word boundary on either side.

    (?<![-0-9a-f:])([da-f]{2}[-:]){5}([da-f]{2})(?![-0-9a-f:])

    There’s also a type of MAC address format like 0123.4567.89ab, so you could more precisely do this:

    (?<![-0-9a-f:])(([da-f]{2}[-:]){5}([da-f]{2})|([da-f]{4}.){2}([da-f]{4}))(?!.?[-0-9a-f:])

    5. XML style markup tags — shouldn’t have an asterisk for the contents, since that could also match an empty <>. And having the non-greedy * followed by a >, and a match for any character but > were accomplishing the same thing so I dropped one.

    (<.+?>)

    Note: I’m assuming these will be processed as case-insensitive, or else there’s a whole new set of problems… There are a million different ways to do regex, so these are just my suggestions — I’m sure there are better ways.

    This was fun!

  • larryp

    Hi birnam,

    I agree with your disagreement with me on item 4. :) I jumped the gun on the number of groups. Good job on the explanations/corrections, too.

  • http://chris.unigliding.co.uk Stormrider

    bah. Another US specific one :(

  • http://www.sitepoint.com/ mmj

    Number 2 searches for any occurence of an ampersand (&) that does NOT appear to be the beginning of a named or numeric entity.

    It may be useful if you need to find what appear to be unescaped ampersands in a string.

    “This & that” would match
    “This &amp; that” would not match

    It doesn’t take into account the validity of the entity reference, and doesn’t account for numeric character entities in hexadecimal form.

  • Anonymous

    Nice collection!
    Here is a good example which late you how to implement Regular Expressions with .net for U.S. Social Security Numbers.

    ^((?!000)([0-6]d{2}|[0-7]{2}[0-2]))-((?!00)d{2})-((?!0000)d{4})$

    Also get complete code for ASP.NET, VB.NET and C#.NET

    Please check:- http://www.tipsntracks.com/98/regular-expressions-with-net-us-social-security-numbers.html