Regular Expressions – Gotta Love Them

Sarah Hawk

This morning I handed over the reigns of our regular Talk with the Experts sessions to Fernando, who did a sterling job of running the chat, the subject of which was Regular Expressions. Our experts today were SitePoint forum staff members Thom Parkin and Allan H, who did an amazing job of explaining a concept that most programmers find pretty sticky.

Here is a list of resources that came out of the session:

What is a Regular Expression?
How to create a RegEx
Syntax and parameters

And if you like puzzles… you may or may not like these…

A RegEx crossword
And another crossword
And another one

If you missed the session today because you didn’t know about it then make sure you sign up for email reminders of future sessions here.

And without further ado – a transcript of the session:

[23:00] <nandotinoco> Welcome to those people that have just joined. Thom Parkin (@ParkinT) is our expert today. He is a staff member of the SitePoint forums and is here to talk about Regular Expressions

[23:01] <ParkinT> AllanH is also a staff member of the Sitepoint forums and will be an expert today.

[23:02] <ParkinT> This topic is just TOO BIG for one person.  

[23:03] <johnlacey> Where would you recommend a complete beginner with next to no experience with regular expressions start? lol

[23:03] <AllanH> There are different “flavors” of regex. we’d like to discuss Perl Compatible Regular Expressions

[23:03] <ParkinT> Regular Expressions are universal among most programming languages.  However, the implementation varies among the languages too

[23:03] <ParkinT> We would like to keep the discussion at a very broad and generic level.

[23:04] <ParkinT> Exactly, Allan.

[23:04] <AllanH> Apache mod rewrite, PHP, Javascript, and of course Perl use PCRE

[23:04] <ParkinT> The purpose and intent of RegEx is to parse, match, find-and-replace characters and strings.

[23:04] <adams> why i must learn regular expression in programming ?

[23:05] <ParkinT> Great question.

[23:05] <ParkinT> Actually, you are not REQUIRED to learn RegEX

[23:05] <ParkinT> You are not required to learn IF or Switch statements.

[23:05] <Jerry> How close are GAWK REs to PCRE?

[23:05] <ParkinT> It is just another tool that can (often) help you.

[23:05] <AllanH> There are good string functions but at times they are not powerful enough

[23:05] <johnlacey> It’s really about pattern recognition, isn’t it? I’ve seen regular expressions to check that email addresses match an expected format and also in .htaccess redirects…

[23:06] <ParkinT> That’s right.

[23:06] <AllanH> And not always so easy

[23:06] <AllanH> I’ve seen some that get what they want and are happy

[23:07] <ParkinT> I am not sure how close GAWK’s implementation is to PCRE.

[23:07] <AllanH> … until they also get what they want to NOT get

[23:07] <Jerry> So true, Allan

[23:08] <johnlacey> So could you give us an example of a (simple) regular expression?

[23:08] <ParkinT> That could be said about all software code, eh?

[23:08] <Jerry> Most times it’s easier to figure out the problem when you get too much than when you don’t get anything

[23:08] <ParkinT> JohnLacey asked an excellent question…

[23:08] <AllanH> True enough, I guess regex is part science and part art

[23:09] <ParkinT> Email validation is the “classic” use case for RegEx but I don’t think it is a very good example.

[23:09] <AllanH> I started with the PHP documentation

[23:09] <johnlacey> Because an email address can fit the prescribed format, but still not exist?

[23:10] <AllanH> Read it and still refer to it often

[23:10] <ParkinT> Parsing data to determine, for example, all the digits AFTER a decimal point might be an example of a “simple” RegEx.  Allan, do you agree?

[23:10] <AllanH> Yes, and something that might come up

[23:11] <ParkinT> Suppose I have this string:

[23:11] <ParkinT> 3.14159

[23:11] <ParkinT> Using RegEx you look for patterns, as johnlacey mentioned.

[23:11] <ParkinT> Allan, correct me where I mis-state anything…

[23:12] <ParkinT> The decimal point becomes the “anchor” in our evaluation.  We want to see what comes AFTER it.

[23:12] <AllanH> and can’t or don’t want to cast it as a float?

[23:13] <ParkinT> DRAT.  I cannot type slashes in this chat.

[23:13] <adams> /\

[23:13] <Jerry> /foo/

[23:13] <ParkinT> Are there control characters that I am not aware of?? I think I just turned off all the power to New York City!!

[23:13] <AllanH> If you knew how many numbers were always in front you could use string funtions

[23:14] <ParkinT> “IF” you knew.  Right.

[23:14] <ParkinT> Suppose you don’t

[23:14] <Jerry> backslash before fwdslash

[23:14] <ParkinT> Thanks.  That wil further complicate this!!!

[23:14] <AllanH> But for our sake we NEED to get that decimal!

[23:15] <ParkinT> /d*[.](d*)/

[23:15] <ParkinT> NO.  The preceding slashes appear too.

[23:15] <ParkinT> Here’s how I would approach it.  The slash ‘d’ represents any ‘digit’ (Numeric)

[23:16] <ParkinT> We know there is an UNKNOWN number of digits BEFORE the decimal point.

[23:16] <ParkinT> slash d followed by the star  d*

[23:16] <AllanH> isn’t “.” a “wildcard”?

[23:16] <ParkinT> Next is the decimal itself.  However, a dot is a command character in RegEx so we need to define it as EXPLICIT

[23:17] <ParkinT> Exactly, AllanH

[23:17] <ParkinT> But if you put characters in square brackets they are evaluated as literals

[23:17] <ParkinT> So [.] would represent the dot

[23:17] <AllanH> and only ONE dot

[23:18] <ParkinT> Next is the data we are trying to capture.  So we must surround it with braces () to represent a group.

[23:18] <ParkinT> and that data will ALSO be a set of digits with an unknown length (d*)

[23:19] <ParkinT> But suppose we are not even sure there are ANY digits before the decimal?

[23:19] <AllanH> the “star” means zero or more

[23:19] <ParkinT> In that case this d*[.](d*) would not work

[23:19] <ParkinT> YOu are correct.  I was confusing the star and the question mark;

[23:20] <ParkinT> which means ONe or more.  Bad example.  I should have used the ? and then explained the star.   *embarrassed.

[23:20] <ParkinT> To better answer the original question, here are some ‘essentials’ of the Regular Expression.

[23:21] <ParkinT> As AllanH pointed out, the star means zero or more and refers to the set that preceded it.

[23:21] <AllanH> I like the Mozilla Docs for Javascript reference

[23:21] <ParkinT> Do you have a link?

[23:22] <AllanH> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions

[23:22] <ParkinT> That is great!  We can go home now!!

[23:22] <AllanH> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp

[23:22] <AllanH> Maybe not, Docs aren’t the easiest thing to digest

[23:23] <AllanH> eg. http://www.pcre.org/pcre.txt

[23:23] <ParkinT> RegEx is hard for most people because it can be very intimidating.

[23:24] <johnlacey> I know I’m only on my second coffee of the day, and my brain is exploding a little just reading the Mozilla documentation. lol

[23:24] <ParkinT> Like anything, if you approach it gently – one bite at a time – and practice in small doses…

[23:24] <ParkinT> The syntax is weird and the choice of characters make it very confusing to read.

[23:24] <AllanH> I only have and still do learn on a “need to know” basis.

[23:25] <johnlacey> Do regular rexpressions vary between languages, or are they pretty universal?

[23:25] <ParkinT> There are many tools (online and desktop) that will evaluate RegEx.  They let you “Poke and try” different patterns

[23:25] <ParkinT> As I said earlier “Regular Expressions are universal among most programming languages.  However, the implementation varies among the languages too “

[23:26] <ParkinT> There are general syntax rules that do not vary among languages.

[23:26] <AllanH> I think once you get the basic syntax down they’re pretty much alike, at least enough so that you can figure out how to do what you need to by referring to the Docs

[23:26] <ParkinT> Ruby, for example, will recognize RegEx in almost anyplace a string could be used.

[23:27] <AllanH> and if it ain’t binary it’s text

[23:29] <AllanH> I think a lot of the “

[23:30] <AllanH> I think a lot of the “tools” eg. match, replace, split – are similar across languages too

[23:30] <nandotinoco> A lot of web developers first get exposed to RegEx when fixing a bug on an .htaccess file or trying to figure out URL redirections. Do you have any tips or a trick for those cases?

[23:30] <ParkinT> Absolutely!!

[23:31] <ParkinT> My first response to that question, nandotinoco, is “StackOverflow” !!

[23:31] <ParkinT> tongue-in-cheek

[23:31] <AllanH> You could try asking in the http://www.sitepoint.com/forums/forumdisplay.php?97-Server-Configuration-Apache-amp-URL-Rewriting forum

[23:32] <nandotinoco> That’s better ;-)

[23:32] <ParkinT> Those rewrites seem to be a beast of their own.

[23:32] <johnlacey> I’ve seen examples where they check for domain.com/directory and change it to domain.com/directory/

[23:32] <AllanH> Apache has things like “flags” that can get tricky at times, but syntax is similar

[23:33] <AllanH> Yes, gotta love “friendly URLs”

[23:33] <ParkinT> That’s right.  By capturing groups and then reapplying what was captured you can completely rearrange things

[23:34] <AllanH> and being able to redirect an HTTP request from an old page to the new page

[23:35] <ParkinT> If you want to sharpen your RegEx skills there are numerous puzzles and crosswords online that use Regular Expressions.

[23:36] <johnlacey> Could you like us to one of those puzzles? Sounds like fun (but also a challenge).

[23:36] <johnlacey> link us*

[23:37] <ParkinT> Searching…

[23:37] <AllanH> How much performance difference do you think there is between using [a-zA-Z] …. [a-z] /i and [w] ?

[23:38] <AllanH> my feeling is use what’s easiest to read when starting out, then work in the more elegant as you progress

[23:38] <ParkinT> That’s a question that is hard to answer, probably varies among languages and – I bet – would be very slight.

[23:39] <ParkinT> Here is one that I admit I have not been able to complete:  http://www.coinheist.com/rubik/a_regular_crossword/grid.pdf

[23:39] <johnlacey> Thanks ParkinT

[23:39] <ParkinT> But, better for beginning, I just found this in a Google search: http://regexcrossword.com/

[23:40] <AllanH> a line that’s say 30 characters long but readable vs. the same effect from one that’s 8 characters long but needs to be mentally “translated”

[23:40] <ParkinT> And, this one looks interesting… http://www.regexcrosswords.com/

[23:40] <ParkinT> I agree, AllanH.

[23:41] <ParkinT> Developers tend to favor ‘elegance’ and ‘cleverness’ a bit too much.

[23:41] <ParkinT> I am quite guilty as charged.

[23:41] <ParkinT> Concise is a good thing to strive for.  But readability is important because MAINTAINING code is critical (and very expensive).

[23:41] <AllanH> and as you say, in terms of performance, negligible difference

[23:42] <AllanH> but we DO like to show off ;)

[23:42] <ParkinT> If another developer (or even the future you) has difficulty deciphering the intent of an expression..

[23:42] <ParkinT> that translates into time which is money.

[23:42] <ParkinT> LOL  ABSOLUTELY.

[23:42] <johnlacey> I completely agree – readability is so important.

[23:43] <AllanH> lol add a comment that’s longer than the verbose code

[23:43] <ParkinT> Perhaps we should take a lesson from those puzzles on line (pun intended) and build a Regular Expression course on Learnables.

[23:44] <grrowl> irt w compared to [a-z], w is actually slower because it matches a LOT more than just a-z, including many other language’s “word” characters

[23:44] <AllanH> @ParkinT one for the MC?

[23:46] <AllanH> true indeed a “word” to Perl is not always an English word

[23:47] <AllanH> eg. my_function

[23:47] <ParkinT> At the same time, “what’s a few milliseconds among friends?”

[23:48] <grrowl> yes, the performance difference is very small… unless you’re specifically optimising that case, always go for the most readable code

[23:49] <AllanH> So I wonder what I would consider to be the basic essential things to “get” first. escape character comes to mind ;)

[23:50] <AllanH> and ^ start and $ end

[23:50] <ParkinT> In my experience the ‘basics’ are those things you use most often.

[23:50] <ParkinT> Yes.  Start and end.  The quantity ? * + {.}

[23:50] <ParkinT> And (what I call) the shortcuts:  w W s S 

[23:51] <ParkinT> d

[23:51] <ParkinT> and the NOT  ^

[23:51] <AllanH> I use quantifiers all the time

[23:51] <AllanH> and character classes

[23:51] <ParkinT> It is important because MOST RegEx implementations are very greedy

[23:52] <ParkinT> Without the quantifiers you could match far beyond the point you intended.

[23:53] <AllanH> true how many times have I seen a thread where the OP wanted a single a tag but was getting the first a tag to the last

[23:53] <ParkinT> To follow up on an earlier comment, I have found this to be very, very instructive:  http://regexcrossword.com/challenges/tutorial/puzzles/1

[23:54] <ParkinT> Click on the HELP in the top navigation area

[23:56] <ParkinT> The history of Regular Expressions is very interesting.  It began before computers in any form like we know them today.

[23:57] <ParkinT> According to Wikipedia (http://en.wikipedia.org/wiki/Regular_expression) around 1950.  I would venture to guess NONE of us here were around then.  And *I* am pretty old !!

[23:58] <ParkinT> Thanks to all of you for taking time to participate.

[23:58] <ParkinT> Sitepoint and Learnables represents an incredibly rich resource for modern web developers.

[23:59] <nandotinoco> Yes, unless anyone wants to ask a final question we should wrap the discussion up here.

[23:59] <ParkinT> If there is something about which you are passionate or feel very comfortable talking about, let us know.

[23:59] <ParkinT> An ‘expert’ is often only the one who is willing to talk about it out loud.

[23:59] <nandotinoco> Thanks so much for your time AllanH and ParkinT and for sharing some of your knowledge

[0:00] <AllanH> I wanted to add that regex questions can be asked in other forums too

[0:00] <ParkinT> Sitepoint forums!!

[0:00] <AllanH> http://www.sitepoint.com/forums/forumdisplay.php?34-PHP

[0:00] <AllanH> http://www.sitepoint.com/forums/forumdisplay.php?15-JavaScript-amp-jQuery

[0:00] <AllanH> http://www.sitepoint.com/forums/forumdisplay.php?36-Perl-amp-Python

[0:01] <nandotinoco> For sure. The forums are always there as a great resource. Thanks to everyone else for joining us. Next week we’re talking SASS

[0:01] <ParkinT> Next week we GET SASSY

[0:02] <AllanH> You’re most welcome nandotinoco, Thanks all

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.