SitePoint Sponsor

User Tag List

Results 1 to 16 of 16
  1. #1
    SitePoint Evangelist -Oz-'s Avatar
    Join Date
    Nov 2001
    Location
    Phoenix, AZ, USA
    Posts
    406
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Remove certain characters or change character set? (XML)

    I'm using PHP to generate xml feeds for content on my gaming website. I thought I had used enough str_replace to get rid of all the problems but apparently not. This is what I currently do:
    PHP Code:
    $get['content'] = strip_tags($get['content']);
              
    $get['content'] = ereg_replace("","'",$get['content']);
              
    // Remove MS word formatting
              
    $get['content'] = str_replace("""'"$get['content']);
              
    $get['content'] = str_replace("""'"$get['content']);
              
    $get['content'] = str_replace('''"'$get['content']);
              
    $get['content'] = str_replace('''"'$get['content']);
              
    $get['content'] = str_replace("""..."$get['content']); 
    Now I have this new problem: http://www.feedvalidator.org/check?u...l/news_top.xml themoney sign that is it (Euro) doesn't show properly in the feed.

    Explanation: This error is commonly seen when an encoding like iso-8859-1 is declared when what actually is desired is windows-1252. It also occurs when numeric character references are computed based on windows-1252 code points values as opposed to the character's code point in ISO/IEC 10646.

    Solution: For maximum portability, convert the characters to either a utf or iso encoding. If that is not practical, try to match the declaration to reflect the actual encoding used. If you chose to use numeric character references, make sure that you use the Unicode codepoint value rather than the codepoint in the native character set. Users on windows platform may find the, cp1252 to Unicode table helpful - of special interest is the mapping of characters in the 0x80 through 0x9F range.
    Is there an easy way I could fix this issue in php so I wouldn't have to think of every possible character that could go wrong?
    Oz
    GamersMark - On Target Gaming
    OzTheory - Programming and Web Solutions
    AmIBlocked - Check if you've been blocked on IM

  2. #2
    $this->toCD-R(LP); vinyl-junkie's Avatar
    Join Date
    Dec 2003
    Location
    Federal Way, Washington (USA)
    Posts
    1,524
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by -Oz-
    Is there an easy way I could fix this issue in php so I wouldn't have to think of every possible character that could go wrong?
    There really shouldn't be that many characters that you'd have to apply a str_replace to.

    Could you show us the whole $get['content'] value prior to applying the str_replace? That might help in advising you on what to do.
    Music Around The World - Collecting tips, trade
    and want lists, album reviews, & more
    Showcase your music collection on the Web

  3. #3
    SitePoint Evangelist -Oz-'s Avatar
    Join Date
    Nov 2001
    Location
    Phoenix, AZ, USA
    Posts
    406
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    the content after the shortening and strip tags is:
    Quote Originally Posted by $get['content'
    ]Sony announced earlier today that the PSP (PlayStation Portable) will launch in Europe on September 1st, price at 249 or 179 (this is roughly $325). A Value Pack that made its debut at the PSP'
    Oz
    GamersMark - On Target Gaming
    OzTheory - Programming and Web Solutions
    AmIBlocked - Check if you've been blocked on IM

  4. #4
    $this->toCD-R(LP); vinyl-junkie's Avatar
    Join Date
    Dec 2003
    Location
    Federal Way, Washington (USA)
    Posts
    1,524
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I'm not aware if there is any comprehensive list of characters anywhere that will trip you up as far as XML validation goes. Perhaps I haven't used a lot of funky characters such as your Euro dollar symbol, but when I encounter such a problem I generally just research it and find out what the equivalent ISO character set is and do a str_replace using the ISO character set, as you have done with those other special characters.

    For what it's worth, the ISO character set equivalent of the Euro dollar is "& #8364;" (as all one string without the quotes).

    Don't know if I offered all that much help but I tried.
    Music Around The World - Collecting tips, trade
    and want lists, album reviews, & more
    Showcase your music collection on the Web

  5. #5
    SitePoint Evangelist -Oz-'s Avatar
    Join Date
    Nov 2001
    Location
    Phoenix, AZ, USA
    Posts
    406
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    that will help. Does anyone know of a script or while loop that will go through an array of characters like that and replace each one?
    Oz
    GamersMark - On Target Gaming
    OzTheory - Programming and Web Solutions
    AmIBlocked - Check if you've been blocked on IM

  6. #6
    $this->toCD-R(LP); vinyl-junkie's Avatar
    Join Date
    Dec 2003
    Location
    Federal Way, Washington (USA)
    Posts
    1,524
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Do you need to loop through an array or do you just need to replace all of a certain character with something else? If it's the latter, str_replace will replace all occurrences. If the former, I'd implode the array, do my str_replace, then explode the array back into its original array state.
    Music Around The World - Collecting tips, trade
    and want lists, album reviews, & more
    Showcase your music collection on the Web

  7. #7
    $this->toCD-R(LP); vinyl-junkie's Avatar
    Join Date
    Dec 2003
    Location
    Federal Way, Washington (USA)
    Posts
    1,524
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Just a follow-up on this thread. Leave it to SitePoint to come up with the perfectly timed newsletter! SitePoint Tech Times #112 has an article titled Character Encodings and Input which should help you a bit with regard to your character problem. The articles it links to are rather long but look like they'll be worth spending some time reading.

    Hope this helps.
    Music Around The World - Collecting tips, trade
    and want lists, album reviews, & more
    Showcase your music collection on the Web

  8. #8
    SitePoint Evangelist -Oz-'s Avatar
    Join Date
    Nov 2001
    Location
    Phoenix, AZ, USA
    Posts
    406
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yeah, got that newsletter in my inbox and thought it was quite ironic.
    Oz
    GamersMark - On Target Gaming
    OzTheory - Programming and Web Solutions
    AmIBlocked - Check if you've been blocked on IM

  9. #9
    SitePoint Evangelist dmsuperman's Avatar
    Join Date
    Feb 2005
    Location
    A box
    Posts
    516
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Just str_replace will find all instances of the first argument and replace with the second argument, so $content = str_replace("", "What to replace with", $get["content"]); would find all 's and replace with "What to replace with" without quotes.
    <(^.^<) \(^.^\) (^.^) (/^.^)/ (>^.^)>
    Core 2 Duo E8400 clocked @ 3.375GHz, 2x2GB 800MHz DDR2 RAM
    5x SATA drives totalling 2.5TB, 7900GS KO, 6600GT

  10. #10
    SitePoint Evangelist -Oz-'s Avatar
    Join Date
    Nov 2001
    Location
    Phoenix, AZ, USA
    Posts
    406
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by dmsuperman
    Just str_replace will find all instances of the first argument and replace with the second argument, so $content = str_replace("", "What to replace with", $get["content"]); would find all 's and replace with "What to replace with" without quotes.
    that is what i currently do, but there are a lot of characters so I was looking for an easier way.
    Oz
    GamersMark - On Target Gaming
    OzTheory - Programming and Web Solutions
    AmIBlocked - Check if you've been blocked on IM

  11. #11
    SitePoint Wizard Young Twig's Avatar
    Join Date
    Dec 2003
    Location
    Albany, New York
    Posts
    1,355
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Use an array:

    PHP Code:
    $chars=array(
                  
    'badchar' => 'replacement',
                  
    'badchar' => 'replacement',
                  
    'badchar' => 'replacement',
                  
    'badchar' => 'replacement',
                  
    'badchar' => 'replacement',
                  
    'badchar' => 'replacement',
                  
    'badchar' => 'replacement'
              
    );

    $content=str_replace(array_keys($chars),array_values($chars),$content); 

  12. #12
    SitePoint Evangelist -Oz-'s Avatar
    Join Date
    Nov 2001
    Location
    Phoenix, AZ, USA
    Posts
    406
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I may have a fix. I'm trying this out right now:
    PHP Code:
    function gb2unicode($gb)
      {
         if(!
    trim($gb))
         return 
    $gb;
         
    $filename="http://www.yourdomain.com/cp1252.txt";
         
    $tmp=file($filename);
         
    $codetable=array();
         while(list(
    $key,$value)=each($tmp))
         
    $codetable[hexdec(substr($value,0,6))]=substr($value,9,4);
         
    $utf="";
         while(
    $gb)
         {
           if (
    ord(substr($gb,0,1))>127)
           {
             
    $this=substr($gb,0,2);
             
    $gb=substr($gb,2,strlen($gb));
             
    $utf.="&#x".$codetable[hexdec(bin2hex($this))-0x8080].";";
           }
           else
           {
              
    $utf.=substr($gb,0,1);
              
    $gb=substr($gb,1,strlen($gb));
           }
           }
        return 
    $utf;
      } 
    And the text file i got from: http://www.unicode.org/Public/MAPPIN...OWS/CP1252.TXT
    So far I think it works, no real way to test yet. The feed validates though.


    OKAY, THIS DIDN'T WORK. for words like can't it made it can (with a space afterwards). Back to research.
    Oz
    GamersMark - On Target Gaming
    OzTheory - Programming and Web Solutions
    AmIBlocked - Check if you've been blocked on IM

  13. #13
    SitePoint Evangelist -Oz-'s Avatar
    Join Date
    Nov 2001
    Location
    Phoenix, AZ, USA
    Posts
    406
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I had enough looking around and wrote a function that covers every character I could come up with and replaces it with its &blah; code:
    PHP Code:
    function utf8encode($text=""){
         
    //Compiled by OzTheory.com
          
    $chars=array( 
              
    '' => '&Ograve;',
              
    '' => '&Oacute;',
              
    '' => '&Ocirc;',
              
    '' => '&Otilde;',
              
    '' => '&Oslash;',
              
    '' => '&Ugrave;',
              
    '' => '&Uacute;',
              
    '' => '&Ucirc;',
              
    '' => '&Uuml;',
              
    '' => '&szlig;',
              
    '' => '&agrave;',
              
    '' => '&aacute;',
              
    '' => '&acirc;',
              
    '' => '&atilde;',
              
    '' => '&auml;',
              
    '' => '&aring;',
              
    '' => '&aelig;',
              
    '' => '&ccedil;',
              
    '' => '&egrave;',
              
    '' => '&eacute;',
              
    '' => '&ecirc;',
              
    '' => '&euml;',
              
    '' => '&igrave;',
              
    '' => '&iacute;',
              
    '' => '&icirc;',
              
    '' => '&iuml;',
              
    '' => '&ntilde;',
              
    '' => '&ograve;',
              
    '' => '&oacute;',
              
    '' => '&ocirc;',
              
    '' => '&otilde;',
              
    '' => '&ouml;',
              
    '' => '&divide;',
              
    '' => '&oslash;',
              
    '' => '&ugrave;',
              
    '' => '&uacute;',
              
    '' => '&ucirc;',
              
    '' => '&uuml;',
              
    '' => '&yuml;',
              
    '' => '',
              
    '' => '',
              
    '' => '',
              
    '' => '',
              
    '' => '',
              
    '' => '',
              
    '' => '',
              
    '' => '',
              
    '' => '',
              
    '' => '',
              
    '' => '',
              
    '' => '',
              
    '' => '',
              
    '' => '',
              
    '' => ''
              
    '' => '&Ntilde;'
              
    '' => '&Iuml;'
              
    '' => '&Icirc;'
              
    '' => '&Iacute;',
              
    '' => '&Igrave;',
              
    '' => '&Euml;'
              
    '' => '&Ecirc;'
              
    '' => '&Eacute;'
              
    '' => '&Egrave;'
              
    '' => '&Ccedil;'
              
    '' => '&AElig;'
              
    '' => '&Aring;'
              
    '' => '&Auml;'
              
    '' => '&Atilde;',
              
    '' => '&Acirc;'
              
    '' => '&Aacute;'
              
    '' => '&Agrave;'
              
    '' => '&iquest;'
              
    '' => '&micro;'
              
    '' => '&plusmn;'
              
    '' => '&deg;'
              
    '' => '&reg;'
              
    '' => '&copy;'
              
    '' => '&uml;',
              
    '' => '&sect;',
              
    '' => '&yen;',
              
    '' => '&pound;',
              
    '' => '',
              
    '' => '&cent;'
              
    '' => '&iexcl;',
              
    '' => "'",
              
    '' => "'",
              
    '' => '"',
              
    '' => '"',
              
    '' => '...',
              
    "'" => ''
          
    );
          
    $text=str_replace(array_keys($chars),array_values($chars),$text);
          return 
    $text;
      } 
    So far it validates all my xml files and appears to work just fine loading content on the HTML page. Let me know if you use it and find any problems.
    Oz
    GamersMark - On Target Gaming
    OzTheory - Programming and Web Solutions
    AmIBlocked - Check if you've been blocked on IM

  14. #14
    One website at a time mmj's Avatar
    Join Date
    Feb 2001
    Location
    Melbourne Australia
    Posts
    6,282
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    To filter out all non-ascii characters:

    PHP Code:
    $text preg_replace('/[^\x09\x0A\x0D\x20-\x7E]+/'''$text); 
    The above will work with US-ASCII, ISO-8859-1, UTF-8, and any other character encoding that is based on ASCII. It works by filtering out any characters other than 09, 0A, 0D, and 20-7E (all hex values).

    If you're using ISO-8859-1 and want to allow extended characters you can use this instead.

    PHP Code:
    $text preg_replace('/[^\x09\x0A\x0D\x20-\x7E\xC0-\xFF]+/'''$text); 
    Let me know how this works out.
    [mmj] My magic jigsaw
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    The Bit Depth Blog Twitter Contact me
    Neon Javascript Framework Jokes Android stuff

  15. #15
    SitePoint Evangelist -Oz-'s Avatar
    Join Date
    Nov 2001
    Location
    Phoenix, AZ, USA
    Posts
    406
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    wont both of those just get rid of the character completely, not replace it?
    Oz
    GamersMark - On Target Gaming
    OzTheory - Programming and Web Solutions
    AmIBlocked - Check if you've been blocked on IM

  16. #16
    One website at a time mmj's Avatar
    Join Date
    Feb 2001
    Location
    Melbourne Australia
    Posts
    6,282
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Yes that's right. So you can use that as a last step, after you have tried to convert the characters to their equivalent in the correct character encoding.
    [mmj] My magic jigsaw
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    The Bit Depth Blog Twitter Contact me
    Neon Javascript Framework Jokes Android stuff


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •