Regex delimiters

force · July 17, 2013, 12:44am

For a long time, any time I’ve needed to use a regular expression, I’ve standardized on using the copyright symbol (©) as the delimiter because it was a symbol that wasn’t on the keyboard that I was sure to not use in a regular expression, unlike ! @ # \ or / (which are sometimes all in use within in a regex).

$result=preg_match('©<.*?>©', '<something string>');

However, today I needed a regular expression with accented characters which included this:

[a-zA-Zàáâäãå&#261;&#263;&#281;èéêëìíîï&#322;&#324;òóôöõøùúûüÿý&#380;&#378;ñç&#269;&#353;&#382;ÀÁÂÄÃÅ&#260;&#262;&#280;ÈÉÊËÌÍÎÏ&#321;&#323;ÒÓÔÖÕØÙÚÛÜ&#376;Ý&#379;&#377;ÑßÇ&#338;Æ&#268;&#352;&#381;&#8706;ð \\,\\.\\'-]+

After including this new regex in the PHP file in my IDE (Eclipse PDT), I was prompted to save the PHP file as UTF-8 instead of the default cp1252.

After saving and running the PHP file, every time I used a regex in a preg_match() or preg_replace() function call, it generated a generic PHP warning (Warning: preg_match in file.php on line x) and the regex was not processed.

So–two questions:

Is there another symbol that would be good to use as a delimiter that isn’t typically found on a keyboard that I can standardize on and not worry about having to check each and every regex to see if that symbol is actually used somewhere in the expression?
Or, is there a I way I can use the copyright symbol as the standard delimiter when the file format is UTF-8?

Jeff_Mott · July 17, 2013, 4:40am

I suppose you could try one of the non-printable characters. Here’s how Notepad++ renders them. Those are the “start of heading” and “start of text” control characters.

QMonkey · July 17, 2013, 7:20pm

For years now I have used the tilde (~) since it is rarely used in a regex (possibly never in the ones I have worked on). That has worked very well for me and there should be no issues with character encoding like you’ve stumbled upon here.

If you’re using unknown values in the regex, you should use preg_quote(). Or just use it all the time if you really don’t want to think about it. (not recommended)

Trying to find a non-printable or some other odd character seems to be inviting trouble some day.

Jeff_Mott · July 17, 2013, 8:26pm

Absolutely agree. I don’t actually use them myself, but if even the copyright symbol isn’t sufficient, then it seems like we need to be drastic and/or clever.

I used to use the ~ myself, then later switched to # for no other reason than it seemed to be the more conventional choice.

lorenw · July 17, 2013, 8:55pm

+1 for ~.
For me it seems easier to read and besides, who uses it anyway

force · July 18, 2013, 12:59am

Not a bad idea, but unfortunately, it’s in use.

Basically, every symbol on the keyboard is used at one point or another, which is the whole problem.

QMonkey · July 18, 2013, 3:32pm

But it isn’t a problem at all. If you have the delimiter character in your regex, you escape it with a backslash. If you’re working with an unknown regex (in a variable), you escape it with preg_quote(). Using some character not on the keyboard is the problem, that’s why you started this thread.

I can understand the desire to have something you will never need to escape, but never can always happen. Any character could show up in a regex - even the copyright.

Topic		Replies	Views
Regex works in text editor, but not in PHP PHP	8	869	July 13, 2011
Regular expression PHP	3	405	October 8, 2014
Correct rules for escaping special characters in a preg_match() PHP	7	4172	October 7, 2022
Delimiter must not be alphanumeric PHP	6	16032	April 16, 2010
Regular expression for al utf8 characters PHP	6	21628	September 14, 2016

Regex delimiters

Related topics