How to sanitize UTF-8 input efficiently?

Josh_Davis · February 28, 2009, 8:34am

Hello,

I’m searching for ways to sanitize UTF-8 input on PHP 5.3. What I’d like to do is make sure that the input is valid UTF-8, then remove invalid (or simply "discouraged) XML chars.

So far, I’ve found that mb_convert_encoding() could fix invalid UTF-8 by running

$str = mb_convert_encoding($str, 'utf-8', 'utf-8');

My problem now is with the handling of invalid chars. Apparently, preg_match() with the u modifier chokes when the input contains Unicode surrogates. I’ve found a regexp that doesn’t require the u modifier at the W3C website, but it’s quite complicated and preg_match_all() chokes with relatively small inputs–a few kilobytes, while I need to process inputs closer to the megabyte.

So now I’m back to square one: is there any (PECL or not) extension that sanitizes UTF-8 that I wouldn’t know about?

Quick recap of what I’ve tried, and why it didn’t work:

[list][]preg_replace() - chokes on Unicode surrogates
[]anything based on preg_match_all() - segfaults on large input
[]Normalizer::normalize() - returns NULL on Unicode surrogates
[]mb_convert_encoding() - leaves Unicode surrogates untouched
[*]iconv() - same as mb_convert_encoding()[/list]

Thanks for any info

Josh_Davis · February 28, 2009, 11:45am

Well, the answer literally came into my sleep. As I was about to take a nap I realized that I could eliminate surrogates with a simple (non Unicode-aware) regexp:

preg_replace('#\\\\xED[\\\\xA0-\\\\xBF][\\\\x80-\\\\xBF]#', '', $str);

…all thanks to the judiciously-chosen character ranges from the Unicode guys.

Any info about UTF-8 sanitization with PHP extensions is still welcome, though

printf · February 28, 2009, 12:22pm

Use an expression modifier in your preg_replace() so when pcre hits a surrogate (a character that does fall into the current plane) it just returns a empty string for that (single) character. That’s much different than just using the unicode modifier because the expression modifier is recompiled after each match is found, whereas using the unicode modifier all matches are returned at once, so if one of those matches is a surrogate, (a character that does fall into the current plane) preg_replace will return an empty string for the complete string. Yes using the e modifier is more intense than just using the u modifier, but the e modifier avoids the problems that u modifier cannot!

Unicode to utf8 handles \u(code) or %u(code)

Fixed the end of the function decode_unicode to return nothing if it’s invalid character


<?php


if ( ! function_exists ( 'unicode_decode' ) )
{
	function unicode_decode ( $string, $charset = '' )
	{
		return preg_replace ( '#(?:\\\\\\|\\\\%)u([0-9a-f]{4})#e', "decode_unicode('\\\\1')", $string );
	}

	function decode_unicode ( $c )
	{
		$c = hexdec ( $c );

		return ( $c < 0x80 ? chr ( $c ) : ( $c < 0x800 ? chr ( 0xc0 | ( $c >> 6 ) ) . chr ( 0x80 | ( $c & 0x3f ) ) : ( $c < 0x10000 ? chr ( 0xe0 | ( $c >> 12 ) ) . chr ( 0x80 | ( ( $c >> 6 ) & 0x3f ) ) . chr ( 0x80 | ( $c & 0x3f ) ) : ( $c < 0x200000 ? chr ( 0xf0 | ( $c >> 18 ) ) . chr ( 0x80 | ( ( $c >> 12 ) & 0x3f ) ) . chr ( 0x80 | ( ( $c >> 6 ) & 0x3f ) ) . chr ( 0x80 | ( $c & 0x3f ) ) : '' ) ) ) );
	}
}

$str = '<div>%u786e%u4fdd%u6d4f%u89c8%u5668%u7684%u5730%u5740%u680f%u4e2d%u663e%u793a%u7684%u7f51%u7ad9%u5730%u5740%u7684%u62fc%u5199%u548c%u683c%u5f0f%u6b63%u786e%u65e0%u8bef%u3002
\\u5982\\u679c\\u901a\\u8fc7\\u5355\\u51fb\\u94fe\\u63a5\\u800c\\u5230\\u8fbe\\u4e86\\u8be5\\u7f51\\u9875\\uff0c\\u8bf7\\u4e0e\\u7f51\\u7ad9\\u7ba1\\u7406\\u5458\\u8054\\u7cfb\\uff0c\\u901a\\u77e5\\u4ed6\\u4eec\\u8be5\\u94fe\\u63a5\\u7684\\u683c\\u5f0f\\u4e0d\\u6b63\\u786e\\u3002


This is the end of the word wrap test. Just a little more               text to go over the maximum          line length so you can see how it works.


</div>';

echo unicode_decode ( $str );


?>

logic_earth · February 28, 2009, 6:39pm

Should just use preg_replace_callback() instead of the "[URL=“http://us2.php.net/manual/en/reference.pcre.pattern.modifiers.php”]e (PREG_REPLACE_EVAL)" modifier, printf.


function unicode_decode ( $string )
{
    function decode_unicode ( $c )
    {
        $c = hexdec( $c[1] );
        return ( $c < 0x80 ? chr ( $c )
             : ( $c < 0x800 ? chr ( 0xc0 | ( $c >> 6 ) ) . chr ( 0x80 | ( $c & 0x3f ) )
             : ( $c < 0x10000 ? chr ( 0xe0 | ( $c >> 12 ) ) . chr ( 0x80 | ( ( $c >> 6 ) & 0x3f ) ) . chr ( 0x80 | ( $c & 0x3f ) )
             : ( $c < 0x200000 ? chr ( 0xf0 | ( $c >> 18 ) ) . chr ( 0x80 | ( ( $c >> 12 ) & 0x3f ) ) . chr ( 0x80 | ( ( $c >> 6 ) & 0x3f ) ) . chr ( 0x80 | ( $c & 0x3f ) )
             : '' ) ) ) );
    }
    return preg_replace_callback( '#(?:\\\\\\|\\\\&#37;)u([0-9a-fA-F]{4})#', 'decode_unicode', $string );
}

$str = '<div>\\u786e\\u4fdd\\u6d4f\\u89c8\\u5668\\u7684\\u5730\\u5740\\u680f\\u4e2d\\u663e\\u793a\\u7684\\u7f51\\u7ad9\\u5730\\u5740\\u7684\\u62fc\\u5199\\u548c\\u683c\\u5f0f\\u6b63\\u786e\\u65e0\\u8bef\\u3002\\u5982\\u679c\\u901a\\u8fc7\\u5355\\u51fb\\u94fe\\u63a5\\u800c\\u5230\\u8fbe\\u4e86\\u8be5\\u7f51\\u9875\\uff0c\\u8bf7\\u4e0e\\u7f51\\u7ad9\\u7ba1\\u7406\\u5458\\u8054\\u7cfb\\uff0c\\u901a\\u77e5\\u4ed6\\u4eec\\u8be5\\u94fe\\u63a5\\u7684\\u683c\\u5f0f\\u4e0d\\u6b63\\u786e\\u3002This is the end of the word wrap test. Just a little more text to go over the maximum line length so you can see how it works.</div>';

print unicode_decode ( $str );

AnthonySterling · February 28, 2009, 6:43pm

[ot]A function inside a function Logic_Earth?

I didn’t think that was possible.[/ot]

logic_earth · February 28, 2009, 6:49pm

[ot]Very possible, one of those little secrets of PHP. Should be aware of this, however:


function test () {
    function another_test() { return __FUNCTION__; }
    return __FUNCTION__;
}

if ( function_exists( 'another_test' ) ) print 'another_test exists';
print test();
if ( function_exists( 'another_test' ) ) print 'another_test exists';

Will only get the last “another_test exists”
Also if I was using PHP 5.3 features I would have used a closure. Would would keep the nested function contained in the other function.
[/ot]

AnthonySterling · February 28, 2009, 6:57pm

Off Topic:

Cheers logic_earth, I’m off for a play!

printf · March 1, 2009, 12:50pm

logic_earth:

Should just use preg_replace_callback() instead of the "[URL=“http://us2.php.net/manual/en/reference.pcre.pattern.modifiers.php”]e (PREG_REPLACE_EVAL)" modifier, printf.


function unicode_decode ( $string )
{
    function decode_unicode ( $c )
    {
        $c = hexdec( $c[1] );
        return ( $c < 0x80 ? chr ( $c )
             : ( $c < 0x800 ? chr ( 0xc0 | ( $c >> 6 ) ) . chr ( 0x80 | ( $c & 0x3f ) )
             : ( $c < 0x10000 ? chr ( 0xe0 | ( $c >> 12 ) ) . chr ( 0x80 | ( ( $c >> 6 ) & 0x3f ) ) . chr ( 0x80 | ( $c & 0x3f ) )
             : ( $c < 0x200000 ? chr ( 0xf0 | ( $c >> 18 ) ) . chr ( 0x80 | ( ( $c >> 12 ) & 0x3f ) ) . chr ( 0x80 | ( ( $c >> 6 ) & 0x3f ) ) . chr ( 0x80 | ( $c & 0x3f ) )
             : '' ) ) ) );
    }
    return preg_replace_callback( '#(?:\\\\\\|\\\\%)u([0-9a-fA-F]{4})#', 'decode_unicode', $string );
}

$str = '<div>\\u786e\\u4fdd\\u6d4f\\u89c8\\u5668\\u7684\\u5730\\u5740\\u680f\\u4e2d\\u663e\\u793a\\u7684\\u7f51\\u7ad9\\u5730\\u5740\\u7684\\u62fc\\u5199\\u548c\\u683c\\u5f0f\\u6b63\\u786e\\u65e0\\u8bef\\u3002\\u5982\\u679c\\u901a\\u8fc7\\u5355\\u51fb\\u94fe\\u63a5\\u800c\\u5230\\u8fbe\\u4e86\\u8be5\\u7f51\\u9875\\uff0c\\u8bf7\\u4e0e\\u7f51\\u7ad9\\u7ba1\\u7406\\u5458\\u8054\\u7cfb\\uff0c\\u901a\\u77e5\\u4ed6\\u4eec\\u8be5\\u94fe\\u63a5\\u7684\\u683c\\u5f0f\\u4e0d\\u6b63\\u786e\\u3002This is the end of the word wrap test. Just a little more text to go over the maximum line length so you can see how it works.</div>';

print unicode_decode ( $str );

You are definitely right logic_earth, _callback is so much faster…

Here is an update…

This handles every conceivable way that browser may pass unicode characters by GET or POST and converts valid characters to utf-8. Fixes numeric entities that may have octal representations too! Bogus characters are replaced with ‘’, (php defaults to ‘?’) if you want change that, look at end decode_unicode and insert the character(s) you want to return for a bogus character!



function unicode_decode ( $string )
{
    function decode_unicode ( $c )
    {
		return ( ( $c = ( isset ( $c[3] ) ? $c[3]
				:
				( isset ( $c[2] ) ? hexdec ( $c[2] )
				:
				hexdec ( $c[1] ) ) ) ) < 0x80 ? chr ( $c )
				:
				( $c < 0x800 ? chr ( 0xc0 | ( $c >> 6 ) ) . chr ( 0x80 | ( $c & 0x3f ) )
				:
				( $c < 0x10000 ? chr ( 0xe0 | ( $c >> 12 ) ) . chr ( 0x80 | ( ( $c >> 6 ) & 0x3f ) ) . chr ( 0x80 | ( $c & 0x3f ) )
				:
				( $c < 0x200000 ? chr ( 0xf0 | ( $c >> 18 ) ) . chr ( 0x80 | ( ( $c >> 12 ) & 0x3f ) ) . chr ( 0x80 | ( ( $c >> 6 ) & 0x3f ) ) . chr ( 0x80 | ( $c & 0x3f ) ) : '' ) ) )
			);
    }

	return preg_replace_callback ( '~(?:\\\\\\u|%u)([0-9a-f]{4})|&#x0*([0-9a-f]+);|&#0*([0-9]+);~i', 'decode_unicode', $string );
}

Topic		Replies	Views
Nervous about UTF-8 breaking my code PHP	9	693	October 8, 2014
Regex accented characters PHP	35	20614	October 8, 2014
Remove e modifier from this preg_replace mod PHP	2	959	October 3, 2016
Sanitizing basic form input (Form Security) PHP	13	2953	June 3, 2011
Regex - removing symbols in utf8 text PHP	3	438	June 17, 2011

How to sanitize UTF-8 input efficiently?

Related topics