Hello all,
I am using simplepie to rss a blog which is using utf-8 (i assume).
When I call the blog I get back this type of stuff
Brand X \x96 people cannot be wrong!
I have highlighted what appears to be a numeric NCR escaped? Anyway on the blog its a long dash like the kind you get in Word documents.
I have set the following in simplepie but I still cannot remove those \x’s !
$this->simplepie->set_output_encoding(‘ISO-8859-1’);
$this->simplepie->strip_htmltags(array(‘base’, ‘blink’, ‘body’, ‘doctype’, ‘embed’, ‘font’, ‘form’, ‘frame’, ‘frameset’, ‘html’, ‘iframe’, ‘input’, ‘marquee’, ‘meta’, ‘noscript’, ‘object’, ‘param’, ‘script’, ‘style’));
Is there a magical regex that I can use in PHP?
I tried this one but did not work
[QUOTE]print preg_replace(“#(\\\x[0-9A-F]{2})#e”, “”, $string);
Cheers
Marc
[/QUOTE]
Do not strip or replace them. Instead ensure that that are able to process and display text as UTF-8 and you will have no further troubles.
Hello pmw57,
Its a bummer but I cannot control where these items are being posted to. It could be Twitter, Facebook or another blog not using UTF-8
Cheers
Marc
Then give utf8_decode a try which decodes from UTF-8 to ISO-8859-1
Hello Yes - tried that and it did not work on that code…
print utf8_decode(‘Brand X \x96 people cannot be wrong!’);
Where is the feed for this blog, so that we can work together with the same information t resolve this issue.
Here ya go: http://businessblogs.co.nz/feed
And thank you for the help!
Cheers
Marc
After some testing, I find that the same utf8-decode page has a user-submitted function called utf2html
Here is how to use it.
// $rss is a UTF-8 string
utf82html($rss);
// $rss is now okay as ISO-8859-1
This is the test code. The file is saved as UTF8 with no BOM, and text for $rss is copied directly from the RSS page as UTF8 text.
<?php
header ('Content-Type: text/html;charset=iso-8859-1');
$rss = 'Brand AC/DC – 35,000 people cannot be wrong!';
function utf2html (&$str) {
$ret = "";
$max = strlen($str);
$last = 0; // keeps the index of the last regular character
for ($i=0; $i<$max; $i++) {
$c = $str{$i};
$c1 = ord($c);
if ($c1>>5 == 6) { // 110x xxxx, 110 prefix for 2 bytes unicode
$ret .= substr($str, $last, $i-$last); // append all the regular characters we've passed
$c1 &= 31; // remove the 3 bit two bytes prefix
$c2 = ord($str{++$i}); // the next byte
$c2 &= 63; // remove the 2 bit trailing byte prefix
$c2 |= (($c1 & 3) << 6); // last 2 bits of c1 become first 2 of c2
$c1 >>= 2; // c1 shifts 2 to the right
$ret .= "&#" . ($c1 * 0x100 + $c2) . ";"; // this is the fastest string concatenation
$last = $i+1;
}
elseif ($c1>>4 == 14) { // 1110 xxxx, 110 prefix for 3 bytes unicode
$ret .= substr($str, $last, $i-$last); // append all the regular characters we've passed
$c2 = ord($str{++$i}); // the next byte
$c3 = ord($str{++$i}); // the third byte
$c1 &= 15; // remove the 4 bit three bytes prefix
$c2 &= 63; // remove the 2 bit trailing byte prefix
$c3 &= 63; // remove the 2 bit trailing byte prefix
$c3 |= (($c2 & 3) << 6); // last 2 bits of c2 become first 2 of c3
$c2 >>=2; //c2 shifts 2 to the right
$c2 |= (($c1 & 15) << 4); // last 4 bits of c1 become first 4 of c2
$c1 >>= 4; // c1 shifts 4 to the right
$ret .= '&#' . (($c1 * 0x10000) + ($c2 * 0x100) + $c3) . ';'; // this is the fastest string concatenation
$last = $i+1;
}
}
$str=$ret . substr($str, $last, $i); // append the last batch of regular characters
}
$utf8html = $rss;
utf2html($rss);
$iso8859_1html = $rss;
echo 'Page is encoded as ISO-8859-1<br>';
echo 'UTF-8 text is ' . $utf8html . '<br>';
echo 'ISO-8859-1 text is ' . $iso8859_1html . '<br>';
?>
The output is:
Page is encoded as ISO-8859-1
Brand AC/DC – 35,000 people cannot be wrong!
Brand AC/DC – 35,000 people cannot be wrong!
Thanks for all the hard work! I will have a look at this now.
When you think about it - would it not be a simple regex call to remove the \x<whatever> or am I missing something about the encodings. …
Removing is destroying the information that was there.
The above utf82html converts them where possible, so that the dashes, quotes and other characters actually remain visible.