How to remove numeric escaped strings

Hello all,

I am using simplepie to rss a blog which is using utf-8 (i assume).

When I call the blog I get back this type of stuff

Brand X \x96 people cannot be wrong!

I have highlighted what appears to be a numeric NCR escaped? Anyway on the blog its a long dash like the kind you get in Word documents.

I have set the following in simplepie but I still cannot remove those \x’s !

$this->simplepie->set_output_encoding(‘ISO-8859-1’);
$this->simplepie->strip_htmltags(array(‘base’, ‘blink’, ‘body’, ‘doctype’, ‘embed’, ‘font’, ‘form’, ‘frame’, ‘frameset’, ‘html’, ‘iframe’, ‘input’, ‘marquee’, ‘meta’, ‘noscript’, ‘object’, ‘param’, ‘script’, ‘style’));

Is there a magical regex that I can use in PHP?

I tried this one but did not work

[QUOTE]print preg_replace(“#(\\\x[0-9A-F]{2})#e”, “”, $string);

Cheers

Marc
[/QUOTE]

Do not strip or replace them. Instead ensure that that are able to process and display text as UTF-8 and you will have no further troubles.

Hello pmw57,

Its a bummer but I cannot control where these items are being posted to. It could be Twitter, Facebook or another blog not using UTF-8

Cheers

Marc

Then give utf8_decode a try which decodes from UTF-8 to ISO-8859-1

Hello Yes - tried that and it did not work on that code…

print utf8_decode(‘Brand X \x96 people cannot be wrong!’);

Where is the feed for this blog, so that we can work together with the same information t resolve this issue.

Here ya go: http://businessblogs.co.nz/feed

And thank you for the help!

Cheers

Marc

After some testing, I find that the same utf8-decode page has a user-submitted function called utf2html

Here is how to use it.


// $rss is a UTF-8 string
utf82html($rss);
// $rss is now okay as ISO-8859-1

This is the test code. The file is saved as UTF8 with no BOM, and text for $rss is copied directly from the RSS page as UTF8 text.


<?php
header ('Content-Type: text/html;charset=iso-8859-1');
$rss = 'Brand AC/DC &#8211; 35,000 people cannot be wrong!';

function utf2html (&$str) {
    
    $ret = "";
    $max = strlen($str);
    $last = 0;  // keeps the index of the last regular character
    for ($i=0; $i<$max; $i++) {
        $c = $str{$i};
        $c1 = ord($c);
        if ($c1>>5 == 6) {  // 110x xxxx, 110 prefix for 2 bytes unicode
            $ret .= substr($str, $last, $i-$last); // append all the regular characters we've passed
            $c1 &= 31; // remove the 3 bit two bytes prefix
            $c2 = ord($str{++$i}); // the next byte
            $c2 &= 63;  // remove the 2 bit trailing byte prefix
            $c2 |= (($c1 & 3) << 6); // last 2 bits of c1 become first 2 of c2
            $c1 >>= 2; // c1 shifts 2 to the right
            $ret .= "&#" . ($c1 * 0x100 + $c2) . ";"; // this is the fastest string concatenation
            $last = $i+1;       
        }
        elseif ($c1>>4 == 14) {  // 1110 xxxx, 110 prefix for 3 bytes unicode
            $ret .= substr($str, $last, $i-$last); // append all the regular characters we've passed
            $c2 = ord($str{++$i}); // the next byte
            $c3 = ord($str{++$i}); // the third byte
            $c1 &= 15; // remove the 4 bit three bytes prefix
            $c2 &= 63;  // remove the 2 bit trailing byte prefix
            $c3 &= 63;  // remove the 2 bit trailing byte prefix
            $c3 |= (($c2 & 3) << 6); // last 2 bits of c2 become first 2 of c3
            $c2 >>=2; //c2 shifts 2 to the right
            $c2 |= (($c1 & 15) << 4); // last 4 bits of c1 become first 4 of c2
            $c1 >>= 4; // c1 shifts 4 to the right
            $ret .= '&#' . (($c1 * 0x10000) + ($c2 * 0x100) + $c3) . ';'; // this is the fastest string concatenation
            $last = $i+1;       
        }
    }
    $str=$ret . substr($str, $last, $i); // append the last batch of regular characters
} 

$utf8html = $rss;

utf2html($rss);
$iso8859_1html = $rss;

echo 'Page is encoded as ISO-8859-1<br>';
echo 'UTF-8 text is ' . $utf8html . '<br>';
echo 'ISO-8859-1 text is ' . $iso8859_1html . '<br>';
?>

The output is:


Page is encoded as ISO-8859-1
Brand AC/DC &#226;&#8364;&#8220; 35,000 people cannot be wrong!
Brand AC/DC &#8211; 35,000 people cannot be wrong!

Thanks for all the hard work! I will have a look at this now.

When you think about it - would it not be a simple regex call to remove the \x<whatever> or am I missing something about the encodings. …

Removing is destroying the information that was there.

The above utf82html converts them where possible, so that the dashes, quotes and other characters actually remain visible.