preg_replace regex help needed

djeyewater · November 28, 2011, 10:09pm

I’m trying to write a function to encode a string as per RFC3987, similar to what rawurlencode does per RFC3986:


function preg_iriencode($url){
	return preg_replace('/[\\x0-\\xc2\\x9f]|[\\xef\\xbf\\xb0-\\xef\\xbf\\xbd]/eu', 'rawurlencode("$0")', $url);
}
$url = 'Exclamation!Question?NBSP*Newline
Atsign@Tab	Hyphen-Plus+Tilde~&#22909;';
echo 'iriencode='.number_format($iriencode/500, 30).' preg_iriencode='.number_format($preg_iriencode/500, 30);

//Expected result: Exclamation%21Question%3FNBSP Newline%0AAtsign%40Tab%09Hyphen-Plus%2BTilde~&#22909;
//Actual result: Exclamation%21Question%3FNBSP%C2%A0Newline%0AAtsign%40Tab%09Hyphen-Plus%2BTilde~&#22909;

Non breaking space (NBSP) is \xC2\xA0 in UTF8, the first character class in the regex only goes up to \xc2\x9f, so I don’t understand why it is being matched and so encoded?

(VBulletin seems to convert the nbsp in the test string into a * but the actual code does have an nbsp in it)

StarLion · November 29, 2011, 2:16pm

well, the problem i think (not 100%, but somewhere around 90%) you’re facing is this:
You write:
[\x0-\xc2\x9f]
and you think it says:
“Anything from character code 0 to character code C29F”.
What it actually says is:
“Anything from character code 0 to character code C2, or character code 9f”.
So it’ll match the \xc2, but not the \xA0.

Try
\x{c29f} instead.

djeyewater · November 29, 2011, 7:07pm

Thanks, I didn’t realise that and haven’t seen the \x{c29f} syntax before.
Unfortunately \x{c29f} didn’t work either.
I changed function as follows:

function preg_iriencode($url){
	return preg_replace('/[\\x0-\\x{c29f}]/eu', 'rawurlencode("$0")', $url);
}

But it now encodes the entire string, including the chinese character, which is way outside that range (\xe5\xa5\xbd)

StarLion · November 29, 2011, 8:33pm

e5, a5, and bd would all be between 00 and c29f.

djeyewater · November 29, 2011, 9:18pm

Cheers, I get you now. I was using UTF8 hex rather than unicode code point so should be [\x{0000}-\x{009F}] for the character class to match what I wanted