Need help converting user friendly search string to regex

Hi,

Basically I need some way of converting a user friendly search string to a regex to apply to a number of different strings.

The search string would contain search terms and some operators, these would be:

  • And (+)
  • Not (-)
  • Wildcard (*)
  • Exact match (“word”)

A search string could look something like


"box of green monkeys" -dancing +singing funny hats

My regex skills are quite limited, so I don’t really know where to start with this. Any help would be really appreciated.

Best regards, George

A regular expression is not the thing to be using here, I’d look at creating a proper parser - or better yet, find a library which has already addressed this. :wink:

Thanks Anthony, but that library is overkill for what I’m doing.

I need this functionality on a newsletter system whereby an admin can “filter” custom field values entered by subscribers. I think the simplest way of achieving it is to somehow convert the admin entered filter into a regular expression that can be applied to the subscribers custom fields to determine if the newsletter should be sent or not.

Best regards, George

Sorry, I wasn’t clear (as usual).

I’m not suggesting you use that library, just the approach. Take the search string, tokenise it, then use the tokens to build your SQL query.

Sorry, I should have been clearer too. The data is saved in XML files (client’s limitation). I’m not using a database at all.

I had to do a lot of regex research but I worked this out. If anybody is interested in doing something similar, here it is.



    public function ruleToRegex($rule)
    {
        $clean_part = function($part) {
            $part = str_replace(array('"','+','-'), '', $part);
            $part = preg_quote($part);
            return $part;
        };

        $wildcard_check = function($part) {
            //fix wildcard character escaped by preg_quote
            $part = str_replace('\\*', '*', $part);

            if ('*' == substr($part, 0, 1) && '*' == substr($part, -1)) {
                return str_replace('*', '', $part);
            } else if ('*' == substr($part, 0, 1)) {
                return str_replace('*', '', $part.'\\b');
            } else if ('*' == substr($part, -1)) {
                return str_replace('*', '', '\\b'.$part);
            } else {
                return str_replace('*', '', '\\b'.$part.'\\b');
            }
        };

        preg_match_all('/"(?:\\.|[^\\"])*"|\\S+/', $rule, $matches);

        if (!empty($matches[0])) {
            $lookaheads = '';
            $ors = array();
            
            //make first always AND unless it's NOT
            if ('-' != substr($matches[0][0], 0, 1)) {
                $matches[0][0] = '+' . $matches[0][0];
            }

            foreach ($matches[0] as $part) {
                switch (substr($part, 0, 1)) {
                    case '+': //AND
                        $lookaheads .= '(?=.*'.$wildcard_check($clean_part($part)).')';
                        break;
                    case '-': //NOT
                        $lookaheads .= '(?!.*'.$wildcard_check($clean_part($part)).')';
                        break;
                    default: //OR
                        $ors[] = $wildcard_check($clean_part($part));
                        break;
                }
            }

            if (count($ors) > 1) {
                if (!empty($lookaheads)) {
                    return '/^(?=.*('.$lookaheads.'|(?=.*('.implode('|', $ors).')))).*\\z/i';
                }

                return '/^(?=.*('.implode('|', $ors).')).*\\z/i';

            } else if (count($ors) == 1) {
                //there are always lookaheads at this point as the first term is always an AND
                return '/^(?=.*(' . $lookaheads . '|(?=.*'.$ors[0].'))).*\\z/i';
            }

            return '/^'.$lookaheads.'.*\\z/i';
        }

        return '/^(.*)\\z/i';
    }


Best regards, George