Comments on: Episode 2: Real-world regular expressions http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/ Mon, 13 Oct 2008 11:28:37 +0000 http://wordpress.org/?v=2.5 By: mmj http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-103229 mmj Wed, 22 Nov 2006 23:11:11 +0000 http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-103229 Number 2 searches for any occurence of an ampersand (&) that does NOT appear to be the beginning of a named or numeric entity. It may be useful if you need to find <em>what appear to be</em> unescaped ampersands in a string. "This & that" would match "This &amp; that" would not match It doesn't take into account the validity of the entity reference, and doesn't account for numeric character entities in hexadecimal form. Number 2 searches for any occurence of an ampersand (&) that does NOT appear to be the beginning of a named or numeric entity.

It may be useful if you need to find what appear to be unescaped ampersands in a string.

“This & that” would match
“This &amp; that” would not match

It doesn’t take into account the validity of the entity reference, and doesn’t account for numeric character entities in hexadecimal form.

]]>
By: Stormrider http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-102707 Stormrider Wed, 22 Nov 2006 12:47:30 +0000 http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-102707 bah. Another US specific one :( bah. Another US specific one :(

]]>
By: larryp http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-102164 larryp Wed, 22 Nov 2006 01:54:15 +0000 http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-102164 Hi birnam, I agree with your disagreement with me on item 4. :) I jumped the gun on the number of groups. Good job on the explanations/corrections, too. Hi birnam,

I agree with your disagreement with me on item 4. :) I jumped the gun on the number of groups. Good job on the explanations/corrections, too.

]]>
By: birnam http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-101984 birnam Tue, 21 Nov 2006 22:44:58 +0000 http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-101984 I agree with Larry, except I think #4 is a MAC address, not an IP address (like mmanders suggested) As for the "errors": 1. US phone number -- doesn't account for a preceding 1, if the area code is in parenthesis, if the digit groups are separated by a dot or space instead of a dash, or the fact that cell phones have Q and Z on them. It also doesn't make sure the group is isolated, and not part of something like 1234888-234-123456123. <pre><code> \b((1[-. ])?\(?[a-z0-9]{3}\)?[-. ][a-z0-9]{3}[-.][a-z0-9]{4})\b</code></pre> 2. HTML character and entity references -- should be a positive look-ahead, not a negative one, and instead of a word character it should be a-z since entity references don't have '_' <pre><code class='html'>&(?=([a-z]+|\#\d+);)</code></pre> 3. numbers in exponential notation (with '1.2354 e10' style exponent) -- I believe that exponential notation only has one digit before the decimal, so the \d* should be dropped and a negative lookbehind added to ensure a single digit. Also, there could be a space between the digits and the exponent. <pre><code class='html'>\b(?<!\d)(-?[0-9])(\.\d+)?\s?([eE][-+ ]?\d+)\b</code></pre> 4. MAC address -- doesn't allow for digit groups to be delineated by hyphens. Because : counts as a non-word character it's not as easy as putting a word boundary on either side. <pre><code class='html'>(?<![-0-9a-f:])([\da-f]{2}[-:]){5}([\da-f]{2})(?![-0-9a-f:])</code></pre> There's also a type of MAC address format like 0123.4567.89ab, so you could more precisely do this: <pre><code class='html'>(?<![-0-9a-f:])(([\da-f]{2}[-:]){5}([\da-f]{2})|([\da-f]{4}\.){2}([\da-f]{4}))(?!\.?[-0-9a-f:])</code></pre> 5. XML style markup tags -- shouldn't have an asterisk for the contents, since that could also match an empty <>. And having the non-greedy * followed by a >, and a match for any character but > were accomplishing the same thing so I dropped one. <pre><code class='html'>(<.+?>)</code></pre> Note: I'm assuming these will be processed as case-insensitive, or else there's a whole new set of problems... There are a million different ways to do regex, so these are just my suggestions -- I'm sure there are better ways. This was fun! I agree with Larry, except I think #4 is a MAC address, not an IP address (like mmanders suggested)

As for the “errors”:

1. US phone number — doesn’t account for a preceding 1, if the area code is in parenthesis, if the digit groups are separated by a dot or space instead of a dash, or the fact that cell phones have Q and Z on them. It also doesn’t make sure the group is isolated, and not part of something like 1234888-234-123456123.

 \b((1[-. ])?\(?[a-z0-9]{3}\)?[-. ][a-z0-9]{3}[-.][a-z0-9]{4})\b

2. HTML character and entity references — should be a positive look-ahead, not a negative one, and instead of a word character it should be a-z since entity references don’t have ‘_’

&(?=([a-z]+|\#\d+);)

3. numbers in exponential notation (with ‘1.2354 e10′ style exponent) — I believe that exponential notation only has one digit before the decimal, so the \d* should be dropped and a negative lookbehind added to ensure a single digit. Also, there could be a space between the digits and the exponent.

\b(?<!\d)(-?[0-9])(\.\d+)?\s?([eE][-+ ]?\d+)\b

4. MAC address — doesn’t allow for digit groups to be delineated by hyphens. Because : counts as a non-word character it’s not as easy as putting a word boundary on either side.

(?<![-0-9a-f:])([\da-f]{2}[-:]){5}([\da-f]{2})(?![-0-9a-f:])

There’s also a type of MAC address format like 0123.4567.89ab, so you could more precisely do this:

(?<![-0-9a-f:])(([\da-f]{2}[-:]){5}([\da-f]{2})|([\da-f]{4}\.){2}([\da-f]{4}))(?!\.?[-0-9a-f:])

5. XML style markup tags — shouldn’t have an asterisk for the contents, since that could also match an empty <>. And having the non-greedy * followed by a >, and a match for any character but > were accomplishing the same thing so I dropped one.

(<.+?>)

Note: I’m assuming these will be processed as case-insensitive, or else there’s a whole new set of problems… There are a million different ways to do regex, so these are just my suggestions — I’m sure there are better ways.

This was fun!

]]>
By: dev_cw http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-101892 dev_cw Tue, 21 Nov 2006 20:47:50 +0000 http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-101892 This is a fun way to learn a bit more about regex. This is a fun way to learn a bit more about regex.

]]>
By: Larry http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-101795 Larry Tue, 21 Nov 2006 19:17:47 +0000 http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-101795 1. Telephone numbers, including those represented as letters. 2. HTML character entity encodings in hexadecimal form. 3. Numbers in scientific/engineering notation. 4. IPv6 IP addresses 5. HTML/XML/XHTML Tags. Any markup, essentially, where the tags use open/closing angle brackets. 1. Telephone numbers, including those represented as letters.
2. HTML character entity encodings in hexadecimal form.
3. Numbers in scientific/engineering notation.
4. IPv6 IP addresses
5. HTML/XML/XHTML Tags. Any markup, essentially, where the tags use open/closing angle brackets.

]]>
By: dix http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-101789 dix Tue, 21 Nov 2006 19:13:59 +0000 http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-101789 Number 2 appears to be matching html character codes (e.g. &nbsp; or ©). The ! is negative lookahead assertion which should be removed to have it work correctly. Number 2 appears to be matching html character codes (e.g. &nbsp; or ©). The ! is negative lookahead assertion which should be removed to have it work correctly.

]]>
By: mmanders http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-101697 mmanders Tue, 21 Nov 2006 17:45:15 +0000 http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-101697 Edit... 5. should read "... anything starting with a '\ Edit… 5. should read “… anything starting with a ‘\

]]>
By: mmanders http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-101696 mmanders Tue, 21 Nov 2006 17:42:45 +0000 http://www.sitepoint.com/blogs/2006/11/22/episode-2-real-world-regular-expressions/#comment-101696 1. I'm not from the states so am unsure, but is it a social security number? If so, then the second sequence should only contain a repitition of 2. <strong>[A-PR-Y0-9]{3}-[A-PR-Y0-9]{2}-[A-PR-Y0-9]{4}</strong> 2. Not a clue about this one! An optional ampersand, followed by an exclamation mark, followed by either one or more words (alphanumerics) or a hash followed by one or more digits, terminated with a semi-colon. 3. I think this is a number expressed in scientific notation, e.g. 1.3e10 - However, I don't think the <strong>\d</strong> is necessary in <strong>"[1-9]\d*"</strong> 4. This looks like a MAC address. Six hex numbers separated by colons. However, I can't see anything wrong with it so I'm probably wrong! 5. This looks like it would match an SGML tag of some sort, although it's not very specific. It will match anything starting with a '' followed by a closing '>'. 1. I’m not from the states so am unsure, but is it a social security number? If so, then the second sequence should only contain a repitition of 2.
[A-PR-Y0-9]{3}-[A-PR-Y0-9]{2}-[A-PR-Y0-9]{4}

2. Not a clue about this one! An optional ampersand, followed by an exclamation mark, followed by either one or more words (alphanumerics) or a hash followed by one or more digits, terminated with a semi-colon.

3. I think this is a number expressed in scientific notation, e.g. 1.3e10 - However, I don’t think the \d is necessary in “[1-9]\d*”

4. This looks like a MAC address. Six hex numbers separated by colons. However, I can’t see anything wrong with it so I’m probably wrong!

5. This looks like it would match an SGML tag of some sort, although it’s not very specific. It will match anything starting with a ” followed by a closing ‘>’.

]]>