RegEx question

Hello, all,

I am working with a Solr collection (keyword searchable) of PDF files that allows wildcards. The allowed wildcards are * and ? (* matches any case where a single character can be anything; ? matches any case where a range of characters can be anything), but the wildcards cannot be the first character of any word (ie; legal use would be “te*t” or “te?t”, but illegal use would be “*est” or “?est”.)

I’m trying to come up with a regex that will remove the wildcard IF it is the first character of any word boundary.

What I currently have that is not working is:

form.keyword = REreplace(form.keyword,'(\b([\?|\*])\w+\b)','\1','all');  // \1 is the ColdFusion equivalent of $1

I’ve never worked with \b, before, so I’m not sure where this is going.

V/r,

:slight_smile:

First problem is that your replacement value is the entire match, including the wildcards you wanted to remove.

(\b([\?|\*])\w+\b)
^  ^
\1 \2

There currently aren’t any capturing parentheses around the value you actually wanted to keep, the word \w+ characters.

Also, a word boundary \b happens when you go from a word \w character to a non-word \W character. The question mark and asterisk are both non-word characters, so they may not trigger a word boundary at the beginning of what we colloquially would call a word.

You may also need to double up the backslashes. Remember that backslashes have special meaning in both regexes and in strings. In strings, for example, \t represents a tab character, \b represents a backspace character, and so on. If you want to represent a literal backslash, you usually have to do so as '\'.

All that being said, here are a couple approaches:

form.keyword = REreplace(form.keyword,'(\\s|^)[?*](\\w)','\1\2','all');

This matches whitespace or the beginning of the string, followed by either ? or * (they don’t have special meaning inside a character class), followed by word characters, and replaced by the whitespace and word characters we matched, leaving out only the characters you wanted to remove.

Here’s another option:

form.keyword = REreplace(form.keyword,'(?<=\\s|^)[?*](?=\\w)','','all');

In this option, the replacement is simpler. We technically only match the offending characters, so the replacement gets to be blank. But to do so we had to use some more advanced regex features: look-aheads and look-behinds.

Much appreciated, @Jeff_Mott! I’ve been working with RegEx for a while, but rarely need to do things such as this. I’ll give these a shot and get back to you.

V/r,

:slight_smile:

@Jeff_Mott,

I removed the extra backslashes, as I wasn’t trying to escape the backslashes. Once I did that, your first suggestion worked like a charm! So much so, in fact, that I’m also using that mask on the client-side, throwing an alert if the user enters ? or * as the first character of any word. Awesome!

Thank you, very much,

:slight_smile:

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.