<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Answers to Episode 2 (Real-life regular expressions)</title>
	<atom:link href="http://www.sitepoint.com/blogs/2006/11/28/answers-to-episode-2-real-life-regular-expressions/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.sitepoint.com/blogs/2006/11/28/answers-to-episode-2-real-life-regular-expressions/</link>
	<description>News, opinion, and fresh thinking for web developers and designers. The official podcast of sitepoint.com.</description>
	<pubDate>Fri, 21 Nov 2008 08:46:37 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5</generator>
		<item>
		<title>By: cnun</title>
		<link>http://www.sitepoint.com/blogs/2006/11/28/answers-to-episode-2-real-life-regular-expressions/#comment-138011</link>
		<dc:creator>cnun</dc:creator>
		<pubDate>Tue, 26 Dec 2006 04:44:48 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=1794#comment-138011</guid>
		<description>thanks for the information :)</description>
		<content:encoded><![CDATA[<p>thanks for the information :)</p>]]></content:encoded>
	</item>
	<item>
		<title>By: lartexpert</title>
		<link>http://www.sitepoint.com/blogs/2006/11/28/answers-to-episode-2-real-life-regular-expressions/#comment-114000</link>
		<dc:creator>lartexpert</dc:creator>
		<pubDate>Sat, 02 Dec 2006 19:01:39 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=1794#comment-114000</guid>
		<description>Whoops! Thanks to the joy of CMS markup, that didn't quite come out right...

For the SGML bit, rather than using &lt;strong&gt;&#60;[^&#62;]*?&#62;&lt;/strong&gt; it would be better to have &lt;strong&gt;&#60;[^&#62;]+&#62;&lt;/strong&gt; to avoid matching empty elements like &lt;code&gt;&#60;&#62;

Crossing fingers that this time it will get through the markup engine ;-)&lt;/code&gt;</description>
		<content:encoded><![CDATA[<p>Whoops! Thanks to the joy of CMS markup, that didn&#8217;t quite come out right&#8230;</p>
<p>For the SGML bit, rather than using <strong>&lt;[^&gt;]*?&gt;</strong> it would be better to have <strong>&lt;[^&gt;]+&gt;</strong> to avoid matching empty elements like <code>&lt;&gt;

Crossing fingers that this time it will get through the markup engine ;-)</code></p>]]></content:encoded>
	</item>
	<item>
		<title>By: lartexpert</title>
		<link>http://www.sitepoint.com/blogs/2006/11/28/answers-to-episode-2-real-life-regular-expressions/#comment-113998</link>
		<dc:creator>lartexpert</dc:creator>
		<pubDate>Sat, 02 Dec 2006 18:57:08 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=1794#comment-113998</guid>
		<description>Why do I miss things like this until they've already been and gone? Oh well...

Couple of things though:
1) US Phone Numbers
IANAA (I am not an American...) If isolating the expression so that it doesn't feature as part of a longer sequence, why not just put \b at the start and end of the regex so that it has to have a word boundary each end?  Assuming that neither area codes nor exchange codes can start with a zero, you could restrict it a little more with
[A-PR-Y1-9][A-PR-Y0-9]{2}-[A-PR-Y1-9][A-PR-Y0-9]{2}-[A-PR-Y0-9]{4}

2) Non-entity ampersands
You could probably tighten it up a little, by using [a-z] instead of \w - I don't know for certain, but I *think* named entities are always a-z chars, not underscores, digits, etc.  Also, it might be possible to restrict the number of digits in the number entity version.  Overall, though, it's a classic example of why parsing HTML with regexes is a Bad Thing(tm)

3) Floating point numbers
You could avoid the problem of passing things like 123.4567e4 by just removign the first \d* from the expression - maybe that's what people meant when they said it's a bug?  Also, what was the reason for capturing the part that comes before the decimal point?  Leaving non-capturing groups aside, what's the benefit of putting this part in brackets?

4) MAC addresses
If you disregard the other forms for a MAC address, one severe failing of this regex is that the {5} quantifier means that the regex will only match MAC addresses that have the same octet five times, e.g. AB:AB:AB:AB:AB:CD - maybe it would have been better to do something like
([\da-f]{2}:[\da-f]{2}:[\da-f]{2}:[\da-f]{2}:[\da-f]{2}:[\da-f]{2})
or
(([\da-f]{2}):([\da-f]{2}):([\da-f]{2}):([\da-f]{2}):([\da-f]{2}))
if you want to capture each octet separately.  There's also the POSIX character class [:xdigit:] for hex digits, but some people find [\da-f] easier to read!

5) SGML elements
]*?&#62; will also match an empty element, like , since * will also match zero occurences - maybe better to have ]+&#62; ... You don't need the non-greedy ? as it won't change what the regex matches.  You could look at restricting things further, maybe elements using a-z, etc, although again there's the trying-to-validate-sgml-with-regex problem again!</description>
		<content:encoded><![CDATA[<p>Why do I miss things like this until they&#8217;ve already been and gone? Oh well&#8230;</p>
<p>Couple of things though:<br />
1) US Phone Numbers<br />
IANAA (I am not an American&#8230;) If isolating the expression so that it doesn&#8217;t feature as part of a longer sequence, why not just put \b at the start and end of the regex so that it has to have a word boundary each end?  Assuming that neither area codes nor exchange codes can start with a zero, you could restrict it a little more with<br />
[A-PR-Y1-9][A-PR-Y0-9]{2}-[A-PR-Y1-9][A-PR-Y0-9]{2}-[A-PR-Y0-9]{4}</p>
<p>2) Non-entity ampersands<br />
You could probably tighten it up a little, by using [a-z] instead of \w - I don&#8217;t know for certain, but I *think* named entities are always a-z chars, not underscores, digits, etc.  Also, it might be possible to restrict the number of digits in the number entity version.  Overall, though, it&#8217;s a classic example of why parsing HTML with regexes is a Bad Thing(tm)</p>
<p>3) Floating point numbers<br />
You could avoid the problem of passing things like 123.4567e4 by just removign the first \d* from the expression - maybe that&#8217;s what people meant when they said it&#8217;s a bug?  Also, what was the reason for capturing the part that comes before the decimal point?  Leaving non-capturing groups aside, what&#8217;s the benefit of putting this part in brackets?</p>
<p>4) MAC addresses<br />
If you disregard the other forms for a MAC address, one severe failing of this regex is that the {5} quantifier means that the regex will only match MAC addresses that have the same octet five times, e.g. AB:AB:AB:AB:AB:CD - maybe it would have been better to do something like<br />
([\da-f]{2}:[\da-f]{2}:[\da-f]{2}:[\da-f]{2}:[\da-f]{2}:[\da-f]{2})<br />
or<br />
(([\da-f]{2}):([\da-f]{2}):([\da-f]{2}):([\da-f]{2}):([\da-f]{2}))<br />
if you want to capture each octet separately.  There&#8217;s also the POSIX character class [:xdigit:] for hex digits, but some people find [\da-f] easier to read!</p>
<p>5) SGML elements<br />
]*?&gt; will also match an empty element, like , since * will also match zero occurences - maybe better to have ]+&gt; &#8230; You don&#8217;t need the non-greedy ? as it won&#8217;t change what the regex matches.  You could look at restricting things further, maybe elements using a-z, etc, although again there&#8217;s the trying-to-validate-sgml-with-regex problem again!</p>]]></content:encoded>
	</item>
	<item>
		<title>By: malikyte</title>
		<link>http://www.sitepoint.com/blogs/2006/11/28/answers-to-episode-2-real-life-regular-expressions/#comment-109004</link>
		<dc:creator>malikyte</dc:creator>
		<pubDate>Tue, 28 Nov 2006 00:46:08 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=1794#comment-109004</guid>
		<description>...hopefully I can actually catch the next question before it's practically already all answered.  :)  Thanks again, Jacob!  This is great stuff!</description>
		<content:encoded><![CDATA[<p>&#8230;hopefully I can actually catch the next question before it&#8217;s practically already all answered.  :)  Thanks again, Jacob!  This is great stuff!</p>]]></content:encoded>
	</item>
</channel>
</rss>
