SitePoint Sponsor

User Tag List

Results 1 to 11 of 11
  1. #1
    SitePoint Enthusiast
    Join Date
    May 2006
    Posts
    48
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    How to remove numeric escaped strings

    Hello all,

    I am using simplepie to rss a blog which is using utf-8 (i assume).

    When I call the blog I get back this type of stuff

    Brand X \x96 people cannot be wrong!

    I have highlighted what appears to be a numeric NCR escaped? Anyway on the blog its a long dash like the kind you get in Word documents.

    I have set the following in simplepie but I still cannot remove those \x's !

    $this->simplepie->set_output_encoding('ISO-8859-1');
    $this->simplepie->strip_htmltags(array('base', 'blink', 'body', 'doctype', 'embed', 'font', 'form', 'frame', 'frameset', 'html', 'iframe', 'input', 'marquee', 'meta', 'noscript', 'object', 'param', 'script', 'style'));

    Is there a magical regex that I can use in PHP?

    I tried this one but did not work

    print preg_replace("#(\\\x[0-9A-F]{2})#e", "", $string);
    Cheers

    Marc

  2. #2
    Unobtrusively zen silver trophybronze trophy
    paul_wilkins's Avatar
    Join Date
    Jan 2007
    Location
    Christchurch, New Zealand
    Posts
    14,696
    Mentioned
    101 Post(s)
    Tagged
    4 Thread(s)
    Do not strip or replace them. Instead ensure that that are able to process and display text as UTF-8 and you will have no further troubles.
    Programming Group Advisor
    Reference: JavaScript, Quirksmode Validate: HTML Validation, JSLint
    Car is to Carpet as Java is to JavaScript

  3. #3
    SitePoint Enthusiast
    Join Date
    May 2006
    Posts
    48
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hello pmw57,

    Its a bummer but I cannot control where these items are being posted to. It could be Twitter, Facebook or another blog not using UTF-8

    Cheers

    Marc

  4. #4
    Unobtrusively zen silver trophybronze trophy
    paul_wilkins's Avatar
    Join Date
    Jan 2007
    Location
    Christchurch, New Zealand
    Posts
    14,696
    Mentioned
    101 Post(s)
    Tagged
    4 Thread(s)
    Then give utf8_decode a try which decodes from UTF-8 to ISO-8859-1
    Programming Group Advisor
    Reference: JavaScript, Quirksmode Validate: HTML Validation, JSLint
    Car is to Carpet as Java is to JavaScript

  5. #5
    SitePoint Enthusiast
    Join Date
    May 2006
    Posts
    48
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hello Yes - tried that and it did not work on that code..

    print utf8_decode('Brand X \x96 people cannot be wrong!');

  6. #6
    Unobtrusively zen silver trophybronze trophy
    paul_wilkins's Avatar
    Join Date
    Jan 2007
    Location
    Christchurch, New Zealand
    Posts
    14,696
    Mentioned
    101 Post(s)
    Tagged
    4 Thread(s)
    Where is the feed for this blog, so that we can work together with the same information t resolve this issue.
    Programming Group Advisor
    Reference: JavaScript, Quirksmode Validate: HTML Validation, JSLint
    Car is to Carpet as Java is to JavaScript

  7. #7
    SitePoint Enthusiast
    Join Date
    May 2006
    Posts
    48
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Here ya go: http://businessblogs.co.nz/feed

    And thank you for the help!

    Cheers

    Marc

  8. #8
    Unobtrusively zen silver trophybronze trophy
    paul_wilkins's Avatar
    Join Date
    Jan 2007
    Location
    Christchurch, New Zealand
    Posts
    14,696
    Mentioned
    101 Post(s)
    Tagged
    4 Thread(s)
    After some testing, I find that the same utf8-decode page has a user-submitted function called utf2html

    Here is how to use it.

    Code php:
    // $rss is a UTF-8 string
    utf82html($rss);
    // $rss is now okay as ISO-8859-1

    This is the test code. The file is saved as UTF8 with no BOM, and text for $rss is copied directly from the RSS page as UTF8 text.

    Code php:
    <?php
    header ('Content-Type: text/html;charset=iso-8859-1');
    $rss = 'Brand AC/DC – 35,000 people cannot be wrong!';
     
    function utf2html (&$str) {
     
        $ret = "";
        $max = strlen($str);
        $last = 0;  // keeps the index of the last regular character
        for ($i=0; $i<$max; $i++) {
            $c = $str{$i};
            $c1 = ord($c);
            if ($c1>>5 == 6) {  // 110x xxxx, 110 prefix for 2 bytes unicode
                $ret .= substr($str, $last, $i-$last); // append all the regular characters we've passed
                $c1 &= 31; // remove the 3 bit two bytes prefix
                $c2 = ord($str{++$i}); // the next byte
                $c2 &= 63;  // remove the 2 bit trailing byte prefix
                $c2 |= (($c1 & 3) << 6); // last 2 bits of c1 become first 2 of c2
                $c1 >>= 2; // c1 shifts 2 to the right
                $ret .= "&#" . ($c1 * 0x100 + $c2) . ";"; // this is the fastest string concatenation
                $last = $i+1;       
            }
            elseif ($c1>>4 == 14) {  // 1110 xxxx, 110 prefix for 3 bytes unicode
                $ret .= substr($str, $last, $i-$last); // append all the regular characters we've passed
                $c2 = ord($str{++$i}); // the next byte
                $c3 = ord($str{++$i}); // the third byte
                $c1 &= 15; // remove the 4 bit three bytes prefix
                $c2 &= 63;  // remove the 2 bit trailing byte prefix
                $c3 &= 63;  // remove the 2 bit trailing byte prefix
                $c3 |= (($c2 & 3) << 6); // last 2 bits of c2 become first 2 of c3
                $c2 >>=2; //c2 shifts 2 to the right
                $c2 |= (($c1 & 15) << 4); // last 4 bits of c1 become first 4 of c2
                $c1 >>= 4; // c1 shifts 4 to the right
                $ret .= '&#' . (($c1 * 0x10000) + ($c2 * 0x100) + $c3) . ';'; // this is the fastest string concatenation
                $last = $i+1;       
            }
        }
        $str=$ret . substr($str, $last, $i); // append the last batch of regular characters
    } 
     
    $utf8html = $rss;
     
    utf2html($rss);
    $iso8859_1html = $rss;
     
    echo 'Page is encoded as ISO-8859-1<br>';
    echo 'UTF-8 text is ' . $utf8html . '<br>';
    echo 'ISO-8859-1 text is ' . $iso8859_1html . '<br>';
    ?>

    The output is:
    Code:
    Page is encoded as ISO-8859-1
    Brand AC/DC &#226;€“ 35,000 people cannot be wrong!
    Brand AC/DC – 35,000 people cannot be wrong!
    Programming Group Advisor
    Reference: JavaScript, Quirksmode Validate: HTML Validation, JSLint
    Car is to Carpet as Java is to JavaScript

  9. #9
    SitePoint Enthusiast
    Join Date
    May 2006
    Posts
    48
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for all the hard work! I will have a look at this now.

  10. #10
    SitePoint Enthusiast
    Join Date
    May 2006
    Posts
    48
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    When you think about it - would it not be a simple regex call to remove the \x<whatever> or am I missing something about the encodings. ....

  11. #11
    Unobtrusively zen silver trophybronze trophy
    paul_wilkins's Avatar
    Join Date
    Jan 2007
    Location
    Christchurch, New Zealand
    Posts
    14,696
    Mentioned
    101 Post(s)
    Tagged
    4 Thread(s)
    Removing is destroying the information that was there.

    The above utf82html converts them where possible, so that the dashes, quotes and other characters actually remain visible.
    Programming Group Advisor
    Reference: JavaScript, Quirksmode Validate: HTML Validation, JSLint
    Car is to Carpet as Java is to JavaScript


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •