The Joy of Regular Expressions [4]

Having found some more joy, time to interrupt your Friday evening viewing, picking up the saga from where we left off last time.

Key Takeaways

Regular expressions can efficiently parse and extract parts of strings, such as breaking down a timestamp into its constituent year, month, day, hour, minute, and second elements.
The PCRE Extended Pattern Modifier allows for more readable and manageable regex patterns by ignoring whitespace and enabling inline comments, enhancing pattern clarity and debugging.
Non-capturing subpatterns in regular expressions optimize memory usage and processing speed by not storing the matched content, useful for checking patterns without needing the actual data.
Branching with the vertical bar operator in regex provides flexibility in pattern matching, allowing for alternative patterns that can match different possible inputs within the same regex.
The `preg_split` function extends the capabilities of string splitting by using regex patterns as delimiters, supporting complex splitting criteria and improving data manipulation and extraction.

Is that a date?
- The d meta character
- More sub patterns
User friendlier dates
Exploding with Patterns
- The White space Meta character
Capturing Split Delimiters

Is that a date?

You’ve already had your first taste of sub patterns here, where they where used to capture a word and wrap it in an HTML span tag via preg_replace_callback(). It’s time to explore sub patterns a little further…

You’ve got a string containing a date / time stamp like ‘20061028134534’ – that’s year (4 digits), month (2 digits), day of the month (2 digits), hour (2 digits, 24 hour clock), minutes and seconds (both 2 digits). You need to break it up into it’s constituent parts so you can use them for calculations.

Now you could use multiple calls to substr() but an alternative solution is a regular expression, for example;


<php
$date = '20061028134534'; # The input date string

preg_match(
            '/^(d{4})(d{2})(d{2})(d{2})(d{2})(d{2})$/',
            $date,
            $matches
          );

print_r($matches);

Looking at the pattern in detail;

/^(d{4})(d{2})(d{2})(d{2})(d{2})(d{2})$/

At the start and end of the pattern are the ^ and $ assertions you’ve already seen, so no problem there.

The d meta character

The d is another meta-character-class, which matches “any decimal digit”. It’s similar to the w meta-character you saw here, but for numbers instead of word characters. In fact it’s shorthand for writing your own character class like [0-9] (which you’ve seen before here)

Attached to every occurrence of d is a length quantifier such as d{4} meaning exactly four digits (for matching the year – 2006) or d{2} – exactly two digits.

More Sub Patterns

So far so good but what is the role of all the sub patterns here? They tell the PCRE engine that I want to capture each sub-match, which PHP will then make available via the third argument to preg_match() – in this example the $matches variable – which is an array populated by reference – more on that in a moment.

The above code outputs the following (the contents of the $matches variable);

Array
(
    [0] => 20061028134534
    [1] => 2006
    [2] => 10
    [3] => 28
    [4] => 13
    [5] => 45
    [6] => 34
)

The first ([0]) element of the array is the complete match which, in this case, happens to the same as the complete input string. Meanwhile the elements indexed [1] to [6] are the components of the date from year down to seconds – they were captured by the sub patterns.

You’ve already seen that preg_match() returns the number of matches it made (which will be either 0 or 1) but it’s third argument is an array that acts as a medium for returning the values which were actually matched. Instead of the normal way you get a result back from a function like;


$result = myfunc();

…you give preg_replace() a variable name as a function argument and it fills it with values for you – something like;


$result = somefunc($more_results_get_put_here_by_reference);

The first element of this array will always contain the complete match (assuming there was one) across the entire pattern while further array indexes correspond to values captured by sub patterns, the order of elements being determined by the relative position of the opening parenthesis ‘(‘ of the sub pattern, when reading the overall pattern from left to right; and note that applies even when you have sub patterns nested inside sub patterns.

So now we’ve turned the date / time stamp into a useful array we can perform calculations with, as well as validating it’s format (but not the actual values of course – 31st Feb 2006 would pass!).

There is another and (perhaps) more elegant approach to processing dates in this format, which you’ll see later on when looking at preg_split().

User friendlier dates

The above time stamp is handy for log files and similar but it’s not the easiest to read. We tend to offer end users dates in format like 28th Oct 2006. So how about a regular expression to validate the format (but not the values!) and extract the interesting parts?


<?php
$date = '28th Oct 2006';

preg_match(
                '/^
                 
                 (d{1,2})  # Match the day of the month
                 
                 (?:st|nd|rd|th) # Match English ordinal suffix
                 
                 x20            # Match space character
                 
                 # Match the month....
                 (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
                 
                 x20            # Match another space character
                 
                 (d{4})         # Match the year
                 
                $/x',
                $date,
                $matches
                );

    
print_r($matches);

The PCRE Extended Pattern Modifier

First thing that may strike you – the regex is full of white space and comments – that’s because I’m using the /x pattern modifier I mentioned before here. From the manual…

If this modifier is set, white space data characters in the pattern are totally ignored except when escaped or inside a character class, and characters between an unescaped # outside a character class and the next newline character, inclusive, are also ignored. This is equivalent to Perl’s /x modifier, and makes it possible to include comments inside complicated patterns. Note, however, that this applies only to data characters. White space characters may never appear within special character sequences in a pattern, for example within the sequence (?( which introduces a conditional sub pattern.

…so it allowed me to insert some comments to help explain what the regular expression is doing. That said, it also helps to be able to see the complete expression on a single line…

/^(d{1,2})(?:st|nd|rd|th)x20(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)x20(d{4})$/

So, examining this pattern piece by piece…

At the start we have the usual “start of subject” assertion ^ followed by a sub pattern (d{1,2}); this matches the digits for the day of the month – it could be the first of the month (only one digit) or in the case of the 28th, it’s two digits, hence the length quantifier {1,2}.

Non Capturing Sub patterns

The next part of the expression introduces two new features at once: (?:st|nd|rd|th). At first glance that may also look like a sub pattern but looking more closely: (?: changes it’s meaning to “non-capturing”. You can think of it as kind of a “custom assertion”, if that helps. Just like the ^ and $ assertions you saw here and the word boundary assertion b discussed here, it asserts a condition which must be met but does not become a captured sub pattern.

Put another way, a sub pattern of the form (?: ) means “this is a non-capturing sub pattern; whatever it matches should not be returned in the results”.

So why don’t I want to capture my (?:st|nd|rd|th) sub pattern? It’s intended to match the Englishing ordinal suffix to a number e.g. 1st, 2nd, 3rd or 4th. I’ve decided this should be included for a valid date format but I’m not actually interested in the value itself, so no need to capture it.

Non-capturing sub patterns are also worth being aware of with respect to memory and processing overhead, as Andrei mentions in this Regex Clinic (pdf, page 99/100) – the PCRE engine doesn’t need to assign memory for their contents. In this example the impact will be insignificant but for larger documents / more complex patterns, performance and memory overhead can become critical.

Branching

So we now we’ve got the non-capturing sub pattern covered, the other new arrival here (?:st|nd|rd|th) is the “vertical bar” character |. This is the “branch operator” and allows you to specify alternative patterns that could be matched.

Its something like the “or” operator in PHP – it allows you to set up alternative conditions, one of which could be met. What I’m doing here is asserting that the digit day of the month must be followed by any one of the strings “st”, “nd”, “rd” or “th” – that covers the English ordinal suffix for any day of the month.

When you use a branch operator, it’s meaning exists either “locally”, within the parenthesis it was embedded in (as with my ordinal suffix example) or it can be used to place a branch in the whole pattern. Consider the following pattern for example;


preg_match('#some [b]bold[/b] text|some [i]italic[/i] text#',$text, $m);

Note: I’ve used # as the expression delimiter as the pattern itself contains forward slashes. I also had to escape [ and ] characters otherwise they’d be considered a character class (see here). This might be a pattern involved in matching BBCode. It could match either;

some [b]bold[/b] text

some [i]italic[/i] text

…thanks to the branch operator in the middle of the pattern.

Hex Literals

Next up in the pattern is this: x20 – that’s representing a character by it’s hex code. If you jump over to your ASCII table, you’ll see that the character with hex code 20 is none other the space character. Huh? Why use a hex code when I can just use a real character? If you remember up there, I’m using the /x pattern modifier, which instructs the regex engine to ignore white space, so we can have a nicely formatted regex. But I want the spaces in “28th Oct 2006” to be part of the pattern, so I need to hex representation to tell the PCRE engine about the space character it should match.

You’ll see more characters specified by their hex code another time. You can also use three digit octal codes for characters, but make sure you read the manual carefully on special cases which apply to them. And watch out for double quoted strings – PHP also has an opinion on what x20 means…

So the pattern is finally starting to make sense… This part (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) is another capturing sub pattern, containing the branch operator again, allowing me to specify alternative three letter strings for the month, while the end of the pattern capture the four digit year, as you’ve seen before. So finally the output from this PHP script, given the input “28th Oct 2006” looks like;

Array
(
    [0] => 28th Oct 2006
    [1] => 28
    [2] => Oct
    [3] => 2006
)

Again the zero element of the array is the complete match, while the following three give me the day, month and year respectively – I now only need to swap “Oct” with a number and I can start calculating.

Supporting Multiple Date Formats

Combining sub patterns with branches can result in powerful expressions. So how about extending the previous example to accept another date format, for example 2006-10-28 (yyyy-mm-dd)?

Matching that format on it’s own would require an expression like;

/^(d{4})-(d{1,2})-(d{1,2})$/

…nothing new there.

But how do we combine it with the previous pattern? In fact it’s nothing too hard – we just need to nest each pattern inside another sub pattern then place a branch between the two (pay attention to the comments below);


<?php
function match_date ($date) {
    if ( preg_match(
                    '/^
     ( # First date format...
         
         (d{1,2})
         (?:st|nd|rd|th)
         x20
         (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
         x20
         (d{4})
         
     )
     
     | # branch
     
     ( # Second date format...
         
         (d{4})-(d{1,2})-(d{1,2})
         
     )
    $/x',
                    $date,
                    $matches
                    )
         )
        {
            return $matches;
        }
    return FALSE;
}

print_r(match_date('28th Oct 2006'));
print_r(match_date('2006-10-28'));

All that’s really happened here is I’ve date the two date format patterns an embedded them in another pattern which has the form /^( )|( )$/.

There’s one small issue though – as you remember the first pattern, when given a date like ’28th Oct 2006′ returns the month in the form “Oct” while my second pattern, given ‘2006-10-28’ as input returns the month as “10”. I need to be able to determine exactly which date format matched, so I can take the correct steps, when needed, to convert the month to an integer.

In fact that’s easily done, knowing that the index preg_match() assigns to each sub pattern is fixed. You can see this by examining the output: print_r(match_date('28th Oct 2006')); produces;

Array
(
    [0] => 28th Oct 2006
    [1] => 28th Oct 2006
    [2] => 28
    [3] => Oct
    [4] => 2006
)

Element [0] is the complete match across the entire pattern as usual. Meanwhile element [1] is what was matched by the first major sub pattern, contain the first date format pattern. Elements [2]–[4] are the components of the date.

Now compare that the output I get when feeding the pattern the alternative date format print_r(match_date('2006-10-28'));;

Array
(
    [0] => 2006-10-28
    [1] => 
    [2] => 
    [3] => 
    [4] => 
    [5] => 2006-10-28
    [6] => 2006
    [7] => 10
    [8] => 28
)

Now elements [1] to [4], corresponding to the first date format are just empty values. The matches for the second date format begin at element [5], which it the match for the second major sub pattern, followed by elements [6] to [8] which are the components of the date again.

So by examining the position of the matches in the returned array, I can determine which sub pattern was matched and handle the results appropriately. In this case I could take the zero element of the array and search the rest of the array for to find the index of the sub pattern that matched – something like;


printf(
       "Matched subpattern %d",
       (array_search($m[0],array_slice($m,1))/4)
       )."n";

I’ll return to this idea another time, when we get into (simple) parsing with PCRE.

Exploding with Patterns

Now you’ve got used to preg_match(), it’s time to introduce another PCRE function – preg_split(). Conceptually it does the same thing as the explode() function but rather than just a simple string delimiter to break up the string, you can use a regular expression to match the delimiter.

For example, how about being able to split some text containing HTML tags into lines? It may be that in some instances you’re dealing with and in others (note forward slash);


<?php
$comment = "This is a comment<br>with mixed breaks<br/>in it";
print_r(preg_split('#<br/?>#',$comment));

I’ve used an # as alternative pattern delimiter, because the pattern itself contains forward slashes: #<br/?>#.

Meanwhile the ? quantifier (which you’ve seen before here) after the forward slash in the pattern means zero or one, allowing me to match both and . The output looks like this;

Array
(
    [0] => This is a comment
    [1] => with mixed breaks
    [2] => in it
)

How about applying the same approach to split up a document by the paragraphs it contains? If the input looks like this;

<p>
    Paragraph one.
</p>

<p>
    Paragraph two.
</p>

<p>
    Paragraph three.
</p>

As a first attempt, lets try the following on it (the input is in the $doc variable);


print_r(preg_split('#</?p>#', $doc));

The pattern is very similar to that used for tags except the forward slash has moved position, which allows me to match both and opening and closing paragraph tag. Here’s the output;

Array
(
    [0] => 
    [1] => 
    Paragraph one.

    [2] => 


    [3] => 
    Paragraph two.

    [4] => 


    [5] => 
    Paragraph three.

    [6] => 
)

Hmmm – there’s lots of white space happening there, so I’ll update the pattern so that white space either side of an opening or closing paragraph tag becomes part of the split delimiter;


print_r(preg_split('#s*</?p>s*#', $doc));

The White Space Meta Character

Remember * is the zero or more quantifier, you’ve already seen here.

So what does s do? It’s another character-class-meta-character, which matches any white space character, such as a space or a new line.

Here’s what the output now looks like;

Array
(
    [0] => 
    [1] => Paragraph one.
    [2] => 
    [3] => Paragraph two.
    [4] => 
    [5] => Paragraph three.
    [6] => 
)

Getting better but what are the empty array elements doing there? They’re a result of a closing tag back-to-back with the next opening tag e.g. . There’s nothing in between but as they are two separate split delimiters involved, preg_match() is creating an empty value for the “void” between them.

It would be nice if we could get rid of them, which we can using the PREG_SPLIT_NO_EMPTY constant flag, which is passed to preg_split() as the fourth argument (the third argument specifies a maximum number of splits or pieces we want returned, -1 meaning “no limit”). As per the manual, using the PREG_SPLIT_NO_EMPTY flag means;

If this flag is set, only non-empty pieces will be returned by preg_split().

So my splitter becomes;


print_r(preg_split('#s*</?p>s*#', $doc, -1, PREG_SPLIT_NO_EMPTY));

…producing the following output…

Array
(
    [0] => Paragraph one.
    [1] => Paragraph two.
    [2] => Paragraph three.
)

…much better.

Capturing Split Delimiters

As I mentioned at the end of the first example, there is another approach to extracting the components from a date / time stamp like ‘20061028134534’, which involves (arguably) misappropriating preg_split(), by taking advantage of the PREG_SPLIT_DELIM_CAPTURE flag. Consulting the manual…

PREG_SPLIT_DELIM_CAPTURE: If this flag is set, parenthesized expression in the delimiter pattern will be captured and returned as well.

Note: not all regex engines support returning the delimiters, as I’ve moaned about before – the common denominator seems to be enterprisey engines, who seem to frown on making developers lives that easy. Anyway… in Perl, PHP, Ruby, Python and (Mozilla!) Javascript, you should find returning a regex split delimiter is supported (and in ICU in fact), making simple tokenizers easy to construct.

So here’s an example that does more or less the same thing as the earlier preg_match() based date / time stamp extractor;


$date = '20061028134534';
print_r(
        preg_split(
                   '/^(d{4})|(d{2})/'
                   ,$date,
                   -1,
                   PREG_SPLIT_DELIM_CAPTURE |
                   PREG_SPLIT_NO_EMPTY
                   )
        );

And the output…

Array
(
    [0] => 2006
    [1] => 10
    [2] => 28
    [3] => 13
    [4] => 45
    [5] => 34
)

It’s using the components of the date as the delimiters to split the date up by then returns those delimiters. Because the input date (in this case) happens to be the right format, I get what I’m looking for, but it’s worth noting this approach fails to validate the date format – you can throw pretty much anything at it and you’ll get some kind of result. But what it brings me is the ability to handle variable length timestamps such as ‘20061028’ and ‘200610281234567890’.

I’ll leave it you to figure out the ins and outs of what it’s doing – try removing the PREG_SPLIT_DELIM_CAPTURE and PREG_SPLIT_NO_EMPTY flags and see what you get. I’ll be returning to PREG_SPLIT_DELIM_CAPTURE and parsing / tokenizing another time.

Wrap Up

That’s more than enough regex for one shot. Key points in this round were sub patterns, the branch operator and preg_split(). More another time (whenever that is).