The Joy of Regular Expressions [3]

Tweet

Following on from the last part, this one is more of an intermission – a round up of regex syntax seen so far and a couple of links following feedback.

Part 4 is here.

Reads

First you have to check out Andrei’s Regex Clinic (slides / pdf) – even if you don’t get it all, it’s worth it for the pictures at the start ;) That’s Andrei as in Zmievski, as in works at Yahoo, can be blamed for Smarty and PHP-GTK, is one of those who needs thanking when PHP6 (with Unicode) hits the streets and, long ago, even did an interview with Sitepoint.

Andrei’s talk also prompts me to confession: I’m not qualified to tell you about the theory behind regular expressions (if you’re interested, start here and Google for more – or annoy these guys) – I’m coming from a practical perspective so while these blogs will (hopefully) help you discover regexes as a useful tool, don’t expect to find out how to to write your own regex engine.

Another read, specific to escaping regular expressions and the types of security holes you might fall into with preg_replace(), is Jeff’s explanation of two preg_replace() escaping gotcha’s, which describes the exact nature of the problem plus provides a solution to escaping replacement strings.

Cumulative Syntax Cheat Sheet

So – a refresher packed with “regexy” terminology.

  • Expression delimiters: – mark the start and end of the pattern itself – discussed here and note relevance for preg_quote() – the second argument – covered here.
  • Literals: match real characters “one-to-one” e.g. alpha-numeric characters: pattern /a/ matches “a” – example here.
  • Pattern Modifiers: – change the global behavior of the regex engine, and are placed after the second expression delimiter. Pattern modifiers you’ve seen so far are;
    • /i – “case insensitive” matching – literal characters from the alphabet, used in the pattern, will match lower or upper case in the string being searched – example here.
      Note: what I haven’t mentioned so far is this behavior can also be effected by your server’s locale settings, similar to the issue with w – see the “Detail Overload!” note here
    • /e – the “evil eval” of preg_replace() – see here but note this comment re: “eval is evil”.
    • /x – “extended mode”, allowing comments in your pattern – gets a brief mention here..
  • Character classes: match in terms of “many-to-one” i.e. a character class in a pattern could match one of a list of characters in the string being searched. Character classes you’ve seen so far are the PCRE “built-ins”;
    • w – a “word character” – basically any letter of the alphabet or numbers – discussed here
    • W – everything else that’s not a word character (white space, punctuation, etc) – also discussed here
    • . (period) – means “anything” – matches any character (but warning: linefeed characters are a special case I haven’t covered yet) – example here
    • …or DIY e.g. [a-zA-Z0-9_] – example here
  • Quantifiers: apply to the preceding character (or meta character) to change the number of times they match (they effect the length of the match which is made). Quantifiers you’ve seen so far are;
    • + – meaning “one or more” – example here
    • ? – meaning “zero or one” – example here
    • * – meaning “zero or more” – also used here
    • Roll your own with curly brackets e.g. {5,20} – see here
  • Assertions: impose a condition which needs to be met for a match to be made, but do not become part of the match themselves. Assertion meta-characters you’ve already seen are;
    • ^ – asserts the “start of a line” or the beginning of the string being searched – discussed here.
    • $ – assert the “end of a line” or the end of the string being searched – also covered here.
    • b – asserts a “word boundary” – the point where one or more characters matched by w (word character class) meets one or more characters matched by W (non-word character class) – covered here.
  • Sub patterns – allow you to group parts of the pattern, allowing you do stuff like “capture” them for preg_replace() and other nice things we haven’t seen yet – discussed here.
  • Backslash – the escape character discussed here (plus see discussion of preg_quote())

That is all.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://diigital.com cranial-bore

    These blog posts have been useful and definitely something I’d like to keep as reference material.
    Unfortunately blog posts often disappear in the sands of time. Would you consider publishing a PDF version including parts 1-N for download?

  • http://www.sitepoint.com AlexW

    Guys, we’re agree that Harry’s done a superb job on the RegEx stuff and we’re looking to polish and republish as a full feature article — perhaps even with a PDF download option.

  • http://en.journey.bg/portal.html 1magic

    Very useful! Saved in my Favorites for ever.

  • Anonymous

    Nice articles.

    Too bad regex sucks monkey balls.

    Well, it would if there was a decent competitor. Too bad everything that could compete are focused on compiler parser generation.

  • http://www.regex.fr PaulArdemue

    Great summary !

  • lorenw

    I have enjoyed these articles, Thanks, regex has saved me many a time.

    If you ever need a pdf version of any web page go here.
    http://www.pdfforge.org/

    It installs a tool bar in ie that allows you to convert any web page to a pdf (from “word” select pdfcreator as a new printer when you go to print to create pdfs that way)

    There, no more waiting for the pdf version. I have my pdf version of these discussions, Many Thanks.

  • Anonymous

    Oh yea, regarding my last post, from firefox go to “print page” and select pdfCreator from the printer list.

  • Anonymous

    thx