Hi,
Basically I need some way of converting a user friendly search string to a regex to apply to a number of different strings.
The search string would contain search terms and some operators, these would be:
- And (+)
- Not (-)
- Wildcard (*)
- Exact match (“word”)
A search string could look something like
"box of green monkeys" -dancing +singing funny hats
My regex skills are quite limited, so I don’t really know where to start with this. Any help would be really appreciated.
Best regards, George
A regular expression is not the thing to be using here, I’d look at creating a proper parser - or better yet, find a library which has already addressed this. 
Thanks Anthony, but that library is overkill for what I’m doing.
I need this functionality on a newsletter system whereby an admin can “filter” custom field values entered by subscribers. I think the simplest way of achieving it is to somehow convert the admin entered filter into a regular expression that can be applied to the subscribers custom fields to determine if the newsletter should be sent or not.
Best regards, George
Sorry, I wasn’t clear (as usual).
I’m not suggesting you use that library, just the approach. Take the search string, tokenise it, then use the tokens to build your SQL query.
Sorry, I should have been clearer too. The data is saved in XML files (client’s limitation). I’m not using a database at all.
I had to do a lot of regex research but I worked this out. If anybody is interested in doing something similar, here it is.
public function ruleToRegex($rule)
{
$clean_part = function($part) {
$part = str_replace(array('"','+','-'), '', $part);
$part = preg_quote($part);
return $part;
};
$wildcard_check = function($part) {
//fix wildcard character escaped by preg_quote
$part = str_replace('\\*', '*', $part);
if ('*' == substr($part, 0, 1) && '*' == substr($part, -1)) {
return str_replace('*', '', $part);
} else if ('*' == substr($part, 0, 1)) {
return str_replace('*', '', $part.'\\b');
} else if ('*' == substr($part, -1)) {
return str_replace('*', '', '\\b'.$part);
} else {
return str_replace('*', '', '\\b'.$part.'\\b');
}
};
preg_match_all('/"(?:\\.|[^\\"])*"|\\S+/', $rule, $matches);
if (!empty($matches[0])) {
$lookaheads = '';
$ors = array();
//make first always AND unless it's NOT
if ('-' != substr($matches[0][0], 0, 1)) {
$matches[0][0] = '+' . $matches[0][0];
}
foreach ($matches[0] as $part) {
switch (substr($part, 0, 1)) {
case '+': //AND
$lookaheads .= '(?=.*'.$wildcard_check($clean_part($part)).')';
break;
case '-': //NOT
$lookaheads .= '(?!.*'.$wildcard_check($clean_part($part)).')';
break;
default: //OR
$ors[] = $wildcard_check($clean_part($part));
break;
}
}
if (count($ors) > 1) {
if (!empty($lookaheads)) {
return '/^(?=.*('.$lookaheads.'|(?=.*('.implode('|', $ors).')))).*\\z/i';
}
return '/^(?=.*('.implode('|', $ors).')).*\\z/i';
} else if (count($ors) == 1) {
//there are always lookaheads at this point as the first term is always an AND
return '/^(?=.*(' . $lookaheads . '|(?=.*'.$ors[0].'))).*\\z/i';
}
return '/^'.$lookaheads.'.*\\z/i';
}
return '/^(.*)\\z/i';
}
Best regards, George