I’m searching for ways to sanitize UTF-8 input on PHP 5.3. What I’d like to do is make sure that the input is valid UTF-8, then remove invalid (or simply "discouraged) XML chars.
So far, I’ve found that mb_convert_encoding() could fix invalid UTF-8 by running
My problem now is with the handling of invalid chars. Apparently, preg_match() with the u modifier chokes when the input contains Unicode surrogates. I’ve found a regexp that doesn’t require the u modifier at the W3C website, but it’s quite complicated and preg_match_all() chokes with relatively small inputs–a few kilobytes, while I need to process inputs closer to the megabyte.
So now I’m back to square one: is there any (PECL or not) extension that sanitizes UTF-8 that I wouldn’t know about?
Quick recap of what I’ve tried, and why it didn’t work:
[list][]preg_replace() - chokes on Unicode surrogates
[]anything based on preg_match_all() - segfaults on large input
[]Normalizer::normalize() - returns NULL on Unicode surrogates
[]mb_convert_encoding() - leaves Unicode surrogates untouched
[*]iconv() - same as mb_convert_encoding()[/list]
Well, the answer literally came into my sleep. As I was about to take a nap I realized that I could eliminate surrogates with a simple (non Unicode-aware) regexp:
Use an expression modifier in your preg_replace() so when pcre hits a surrogate (a character that does fall into the current plane) it just returns a empty string for that (single) character. That’s much different than just using the unicode modifier because the expression modifier is recompiled after each match is found, whereas using the unicode modifier all matches are returned at once, so if one of those matches is a surrogate, (a character that does fall into the current plane) preg_replace will return an empty string for the complete string. Yes using the e modifier is more intense than just using the u modifier, but the e modifier avoids the problems that u modifier cannot!
Unicode to utf8 handles \u(code) or %u(code)
Fixed the end of the function decode_unicode to return nothing if it’s invalid character
<?php
if ( ! function_exists ( 'unicode_decode' ) )
{
function unicode_decode ( $string, $charset = '' )
{
return preg_replace ( '#(?:\\\\\\|\\\\%)u([0-9a-f]{4})#e', "decode_unicode('\\\\1')", $string );
}
function decode_unicode ( $c )
{
$c = hexdec ( $c );
return ( $c < 0x80 ? chr ( $c ) : ( $c < 0x800 ? chr ( 0xc0 | ( $c >> 6 ) ) . chr ( 0x80 | ( $c & 0x3f ) ) : ( $c < 0x10000 ? chr ( 0xe0 | ( $c >> 12 ) ) . chr ( 0x80 | ( ( $c >> 6 ) & 0x3f ) ) . chr ( 0x80 | ( $c & 0x3f ) ) : ( $c < 0x200000 ? chr ( 0xf0 | ( $c >> 18 ) ) . chr ( 0x80 | ( ( $c >> 12 ) & 0x3f ) ) . chr ( 0x80 | ( ( $c >> 6 ) & 0x3f ) ) . chr ( 0x80 | ( $c & 0x3f ) ) : '' ) ) ) );
}
}
$str = '<div>%u786e%u4fdd%u6d4f%u89c8%u5668%u7684%u5730%u5740%u680f%u4e2d%u663e%u793a%u7684%u7f51%u7ad9%u5730%u5740%u7684%u62fc%u5199%u548c%u683c%u5f0f%u6b63%u786e%u65e0%u8bef%u3002
\\u5982\\u679c\\u901a\\u8fc7\\u5355\\u51fb\\u94fe\\u63a5\\u800c\\u5230\\u8fbe\\u4e86\\u8be5\\u7f51\\u9875\\uff0c\\u8bf7\\u4e0e\\u7f51\\u7ad9\\u7ba1\\u7406\\u5458\\u8054\\u7cfb\\uff0c\\u901a\\u77e5\\u4ed6\\u4eec\\u8be5\\u94fe\\u63a5\\u7684\\u683c\\u5f0f\\u4e0d\\u6b63\\u786e\\u3002
This is the end of the word wrap test. Just a little more text to go over the maximum line length so you can see how it works.
</div>';
echo unicode_decode ( $str );
?>
[ot]Very possible, one of those little secrets of PHP. Should be aware of this, however:
function test () {
function another_test() { return __FUNCTION__; }
return __FUNCTION__;
}
if ( function_exists( 'another_test' ) ) print 'another_test exists';
print test();
if ( function_exists( 'another_test' ) ) print 'another_test exists';
Will only get the last “another_test exists”
Also if I was using PHP 5.3 features I would have used a closure. Would would keep the nested function contained in the other function.
[/ot]
You are definitely right logic_earth, _callback is so much faster…
Here is an update…
This handles every conceivable way that browser may pass unicode characters by GET or POST and converts valid characters to utf-8. Fixes numeric entities that may have octal representations too! Bogus characters are replaced with ‘’, (php defaults to ‘?’) if you want change that, look at end decode_unicode and insert the character(s) you want to return for a bogus character!