How to sanitize UTF-8 input efficiently?

Hello,

I’m searching for ways to sanitize UTF-8 input on PHP 5.3. What I’d like to do is make sure that the input is valid UTF-8, then remove invalid (or simply "discouraged) XML chars.

So far, I’ve found that mb_convert_encoding() could fix invalid UTF-8 by running

$str = mb_convert_encoding($str, 'utf-8', 'utf-8');

My problem now is with the handling of invalid chars. Apparently, preg_match() with the u modifier chokes when the input contains Unicode surrogates. I’ve found a regexp that doesn’t require the u modifier at the W3C website, but it’s quite complicated and preg_match_all() chokes with relatively small inputs–a few kilobytes, while I need to process inputs closer to the megabyte.

So now I’m back to square one: is there any (PECL or not) extension that sanitizes UTF-8 that I wouldn’t know about?

Quick recap of what I’ve tried, and why it didn’t work:

[list][]preg_replace() - chokes on Unicode surrogates
[
]anything based on preg_match_all() - segfaults on large input
[]Normalizer::normalize() - returns NULL on Unicode surrogates
[
]mb_convert_encoding() - leaves Unicode surrogates untouched
[*]iconv() - same as mb_convert_encoding()[/list]

Thanks for any info :slight_smile:

Well, the answer literally came into my sleep. As I was about to take a nap I realized that I could eliminate surrogates with a simple (non Unicode-aware) regexp:

preg_replace('#\\\\xED[\\\\xA0-\\\\xBF][\\\\x80-\\\\xBF]#', '', $str);

…all thanks to the judiciously-chosen character ranges from the Unicode guys.

Any info about UTF-8 sanitization with PHP extensions is still welcome, though :wink:

Use an expression modifier in your preg_replace() so when pcre hits a surrogate (a character that does fall into the current plane) it just returns a empty string for that (single) character. That’s much different than just using the unicode modifier because the expression modifier is recompiled after each match is found, whereas using the unicode modifier all matches are returned at once, so if one of those matches is a surrogate, (a character that does fall into the current plane) preg_replace will return an empty string for the complete string. Yes using the e modifier is more intense than just using the u modifier, but the e modifier avoids the problems that u modifier cannot!

Unicode to utf8 handles \u(code) or %u(code)

Fixed the end of the function decode_unicode to return nothing if it’s invalid character


<?php


if ( ! function_exists ( 'unicode_decode' ) )
{
	function unicode_decode ( $string, $charset = '' )
	{
		return preg_replace ( '#(?:\\\\\\|\\\\%)u([0-9a-f]{4})#e', "decode_unicode('\\\\1')", $string );
	}

	function decode_unicode ( $c )
	{
		$c = hexdec ( $c );

		return ( $c < 0x80 ? chr ( $c ) : ( $c < 0x800 ? chr ( 0xc0 | ( $c >> 6 ) ) . chr ( 0x80 | ( $c & 0x3f ) ) : ( $c < 0x10000 ? chr ( 0xe0 | ( $c >> 12 ) ) . chr ( 0x80 | ( ( $c >> 6 ) & 0x3f ) ) . chr ( 0x80 | ( $c & 0x3f ) ) : ( $c < 0x200000 ? chr ( 0xf0 | ( $c >> 18 ) ) . chr ( 0x80 | ( ( $c >> 12 ) & 0x3f ) ) . chr ( 0x80 | ( ( $c >> 6 ) & 0x3f ) ) . chr ( 0x80 | ( $c & 0x3f ) ) : '' ) ) ) );
	}
}

$str = '<div>%u786e%u4fdd%u6d4f%u89c8%u5668%u7684%u5730%u5740%u680f%u4e2d%u663e%u793a%u7684%u7f51%u7ad9%u5730%u5740%u7684%u62fc%u5199%u548c%u683c%u5f0f%u6b63%u786e%u65e0%u8bef%u3002
\\u5982\\u679c\\u901a\\u8fc7\\u5355\\u51fb\\u94fe\\u63a5\\u800c\\u5230\\u8fbe\\u4e86\\u8be5\\u7f51\\u9875\\uff0c\\u8bf7\\u4e0e\\u7f51\\u7ad9\\u7ba1\\u7406\\u5458\\u8054\\u7cfb\\uff0c\\u901a\\u77e5\\u4ed6\\u4eec\\u8be5\\u94fe\\u63a5\\u7684\\u683c\\u5f0f\\u4e0d\\u6b63\\u786e\\u3002


This is the end of the word wrap test. Just a little more               text to go over the maximum          line length so you can see how it works.


</div>';

echo unicode_decode ( $str );


?>


Should just use preg_replace_callback() instead of the "[URL=“http://us2.php.net/manual/en/reference.pcre.pattern.modifiers.php”]e (PREG_REPLACE_EVAL)" modifier, printf.


function unicode_decode ( $string )
{
    function decode_unicode ( $c )
    {
        $c = hexdec( $c[1] );
        return ( $c < 0x80 ? chr ( $c )
             : ( $c < 0x800 ? chr ( 0xc0 | ( $c >> 6 ) ) . chr ( 0x80 | ( $c & 0x3f ) )
             : ( $c < 0x10000 ? chr ( 0xe0 | ( $c >> 12 ) ) . chr ( 0x80 | ( ( $c >> 6 ) & 0x3f ) ) . chr ( 0x80 | ( $c & 0x3f ) )
             : ( $c < 0x200000 ? chr ( 0xf0 | ( $c >> 18 ) ) . chr ( 0x80 | ( ( $c >> 12 ) & 0x3f ) ) . chr ( 0x80 | ( ( $c >> 6 ) & 0x3f ) ) . chr ( 0x80 | ( $c & 0x3f ) )
             : '' ) ) ) );
    }
    return preg_replace_callback( '#(?:\\\\\\|\\\\&#37;)u([0-9a-fA-F]{4})#', 'decode_unicode', $string );
}

$str = '<div>\\u786e\\u4fdd\\u6d4f\\u89c8\\u5668\\u7684\\u5730\\u5740\\u680f\\u4e2d\\u663e\\u793a\\u7684\\u7f51\\u7ad9\\u5730\\u5740\\u7684\\u62fc\\u5199\\u548c\\u683c\\u5f0f\\u6b63\\u786e\\u65e0\\u8bef\\u3002\\u5982\\u679c\\u901a\\u8fc7\\u5355\\u51fb\\u94fe\\u63a5\\u800c\\u5230\\u8fbe\\u4e86\\u8be5\\u7f51\\u9875\\uff0c\\u8bf7\\u4e0e\\u7f51\\u7ad9\\u7ba1\\u7406\\u5458\\u8054\\u7cfb\\uff0c\\u901a\\u77e5\\u4ed6\\u4eec\\u8be5\\u94fe\\u63a5\\u7684\\u683c\\u5f0f\\u4e0d\\u6b63\\u786e\\u3002This is the end of the word wrap test. Just a little more text to go over the maximum line length so you can see how it works.</div>';

print unicode_decode ( $str );

[ot]A function inside a function Logic_Earth?

I didn’t think that was possible.[/ot]

[ot]Very possible, one of those little secrets of PHP. Should be aware of this, however:


function test () {
    function another_test() { return __FUNCTION__; }
    return __FUNCTION__;
}

if ( function_exists( 'another_test' ) ) print 'another_test exists';
print test();
if ( function_exists( 'another_test' ) ) print 'another_test exists';

Will only get the last “another_test exists” :slight_smile:
Also if I was using PHP 5.3 features I would have used a closure. Would would keep the nested function contained in the other function.
[/ot]

Off Topic:

Cheers logic_earth, I’m off for a play!

You are definitely right logic_earth, _callback is so much faster…

Here is an update…

This handles every conceivable way that browser may pass unicode characters by GET or POST and converts valid characters to utf-8. Fixes numeric entities that may have octal representations too! Bogus characters are replaced with ‘’, (php defaults to ‘?’) if you want change that, look at end decode_unicode and insert the character(s) you want to return for a bogus character!



function unicode_decode ( $string )
{
    function decode_unicode ( $c )
    {
		return ( ( $c = ( isset ( $c[3] ) ? $c[3]
				:
				( isset ( $c[2] ) ? hexdec ( $c[2] )
				:
				hexdec ( $c[1] ) ) ) ) < 0x80 ? chr ( $c )
				:
				( $c < 0x800 ? chr ( 0xc0 | ( $c >> 6 ) ) . chr ( 0x80 | ( $c & 0x3f ) )
				:
				( $c < 0x10000 ? chr ( 0xe0 | ( $c >> 12 ) ) . chr ( 0x80 | ( ( $c >> 6 ) & 0x3f ) ) . chr ( 0x80 | ( $c & 0x3f ) )
				:
				( $c < 0x200000 ? chr ( 0xf0 | ( $c >> 18 ) ) . chr ( 0x80 | ( ( $c >> 12 ) & 0x3f ) ) . chr ( 0x80 | ( ( $c >> 6 ) & 0x3f ) ) . chr ( 0x80 | ( $c & 0x3f ) ) : '' ) ) )
			);
    }

	return preg_replace_callback ( '~(?:\\\\\\u|%u)([0-9a-f]{4})|&#x0*([0-9a-f]+);|&#0*([0-9]+);~i', 'decode_unicode', $string );
}