Programming
Article
By Harry Fuecks

The Joy of Regular Expressions [3]

By Harry Fuecks

Following on from the last part, this one is more of an intermission – a round up of regex syntax seen so far and a couple of links following feedback.

Part 4 is here.

Reads

First you have to check out Andrei’s Regex Clinic (slides / pdf) – even if you don’t get it all, it’s worth it for the pictures at the start ;) That’s Andrei as in Zmievski, as in works at Yahoo, can be blamed for Smarty and PHP-GTK, is one of those who needs thanking when PHP6 (with Unicode) hits the streets and, long ago, even did an interview with Sitepoint.

Andrei’s talk also prompts me to confession: I’m not qualified to tell you about the theory behind regular expressions (if you’re interested, start here and Google for more – or annoy these guys) – I’m coming from a practical perspective so while these blogs will (hopefully) help you discover regexes as a useful tool, don’t expect to find out how to to write your own regex engine.

Another read, specific to escaping regular expressions and the types of security holes you might fall into with preg_replace(), is Jeff’s explanation of two preg_replace() escaping gotcha’s, which describes the exact nature of the problem plus provides a solution to escaping replacement strings.

Cumulative Syntax Cheat Sheet

So – a refresher packed with “regexy” terminology.

  • Expression delimiters: – mark the start and end of the pattern itself – discussed here and note relevance for preg_quote() – the second argument – covered here.
  • Literals: match real characters “one-to-one” e.g. alpha-numeric characters: pattern /a/ matches “a” – example here.
  • Pattern Modifiers: – change the global behavior of the regex engine, and are placed after the second expression delimiter. Pattern modifiers you’ve seen so far are;
    • /i – “case insensitive” matching – literal characters from the alphabet, used in the pattern, will match lower or upper case in the string being searched – example here.
      Note: what I haven’t mentioned so far is this behavior can also be effected by your server’s locale settings, similar to the issue with w – see the “Detail Overload!” note here
    • /e – the “evil eval” of preg_replace() – see here but note this comment re: “eval is evil”.
    • /x – “extended mode”, allowing comments in your pattern – gets a brief mention here..
  • Character classes: match in terms of “many-to-one” i.e. a character class in a pattern could match one of a list of characters in the string being searched. Character classes you’ve seen so far are the PCRE “built-ins”;
    • w – a “word character” – basically any letter of the alphabet or numbers – discussed here
    • W – everything else that’s not a word character (white space, punctuation, etc) – also discussed here
    • . (period) – means “anything” – matches any character (but warning: linefeed characters are a special case I haven’t covered yet) – example here
    • …or DIY e.g. [a-zA-Z0-9_] – example here
  • Quantifiers: apply to the preceding character (or meta character) to change the number of times they match (they effect the length of the match which is made). Quantifiers you’ve seen so far are;
    • + – meaning “one or more” – example here
    • ? – meaning “zero or one” – example here
    • * – meaning “zero or more” – also used here
    • Roll your own with curly brackets e.g. {5,20} – see here
  • Assertions: impose a condition which needs to be met for a match to be made, but do not become part of the match themselves. Assertion meta-characters you’ve already seen are;
    • ^ – asserts the “start of a line” or the beginning of the string being searched – discussed here.
    • $ – assert the “end of a line” or the end of the string being searched – also covered here.
    • b – asserts a “word boundary” – the point where one or more characters matched by w (word character class) meets one or more characters matched by W (non-word character class) – covered here.
  • Sub patterns – allow you to group parts of the pattern, allowing you do stuff like “capture” them for preg_replace() and other nice things we haven’t seen yet – discussed here.
  • Backslash – the escape character discussed here (plus see discussion of preg_quote())

That is all.

Recommended
Sponsors
The most important and interesting stories in tech. Straight to your inbox, daily. Get Versioning.
Login or Create Account to Comment
Login Create Account