REGEX Help Required

Hi guys,

I have inherited some code that does a rather slow and messy check of what characters are allowed and what aren’t. I have this string:


I then have any variable that will be checked against it. It does so by looping through every character in the variable and then looping through every character in the string above to see if there is a match, and if not it removes it. However, as I’m sure you can imagine, this is SLOW and when it’s parsing through more than 3000 customer records to create a CSV export, it times out. The customer has asked that I increase the time limit but personally I just think the whole thing needs a rewrite.

So, I wanted to use preg_replace() to remove any characters that aren’t in that list. I’m not usually too bad at REGEX, but I can’t get this one to work quite right. This is the REGEX I’ve got so far:

/([^ '_%\\\\{\\\\}\\\\(\\\\)\\\\[\\\\]\\\\/\\\\+-@\\\\.a-zA-Z0-9]*)/

It’s not the tidiest in the world, but it works, with the exception that it leaves commas in

  $txt = 'Hello, this is a test &%^£';

  $pattern = "/([^ '_%\\\\{\\\\}\\\\(\\\\)\\\\[\\\\]\\\\/\\\\+-@\\\\.a-zA-Z0-9]*)/";
  echo preg_replace($pattern, '', $txt);

Any advice please on what to do about the commas?

Well wrapping the entire thing in () is pointless (creating a subpattern of the entire pattern…)
What… exactly are you trying to do? Way too many slashes in there to make any sense.


\w = [_a-zA-Z0-9] , so lets throw those away…

well, the only meta characters (while inside a class) there are - and ], and you want to negate the class, so…

Should be a valid pattern.

EDIT: If \ was meant to be part of the class, add a \\ in there too

Theoretically it’s really simple; only allow the characters in the original string to be in each individual field. I don’t know why these characters are the only ones allowed, since the guy who wrote it has long since left, I’m just trying to speed it up. The ONLY problem that I have is that commas are being allowed to pass through, even though they’re not in the list of allowed characters (unless there’s something in there that is comparable to a comma that is letting it in?). Commas absolutely can’t be allowed since this is for a CSV export.

Dunno how the parentheses got in there, sorry. Gone now.

I’m thinking that the extra slashes in the pattern you posted was unexpectedly closing the class, meaning it never matched. Try the pattern above and see how it goes. Use Single Quotes, that way you dont have to fudge around with PHP’s string interpreter.

I managed to get the original one to work by adding \\ in front of - (dunno how I missed that) as that sorted the issue with the commas.

The usual output that I get from ‘Hello, this is a test &%^£’ is now ‘Hello this is a test %’ which is what I’d expect. What I get from yours is ‘Hellothisisatest%Â’. I have added the space to the beginning of the character class, which helps, but ‘Â’ should not be in there still as yes, it’s covered by \w but it’s not in the list of acceptable characters in the above list

Oh yes, just seen your last reply:

“/[^ '_%\{\}\(\)\[\]\/\+\-@\.a-zA-Z0-9]*/” works without those extra slashes in. They were in because I’d copied some code from one of these REGEX generators and I’d not cleaned them up

Have just tested the new code on live data, and suffice to say that by using preg_replace() it’s a lot quicker. Gone are the 30-second script timeouts, and here are sub 1 second runtimes. The REGEX ain’t pretty, but my God does it work :slight_smile: