SitePoint Sponsor |
|
User Tag List
Results 1 to 16 of 16
-
Apr 27, 2005, 11:52 #1
- Join Date
- Nov 2001
- Location
- Phoenix, AZ, USA
- Posts
- 406
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Remove certain characters or change character set? (XML)
I'm using PHP to generate xml feeds for content on my gaming website. I thought I had used enough str_replace to get rid of all the problems but apparently not. This is what I currently do:
PHP Code:$get['content'] = strip_tags($get['content']);
$get['content'] = ereg_replace("’","'",$get['content']);
// Remove MS word formatting
$get['content'] = str_replace("’", "'", $get['content']);
$get['content'] = str_replace("‘", "'", $get['content']);
$get['content'] = str_replace('“', '"', $get['content']);
$get['content'] = str_replace('”', '"', $get['content']);
$get['content'] = str_replace("…", "...", $get['content']);
Explanation: This error is commonly seen when an encoding like iso-8859-1 is declared when what actually is desired is windows-1252. It also occurs when numeric character references are computed based on windows-1252 code points values as opposed to the character's code point in ISO/IEC 10646.
Solution: For maximum portability, convert the characters to either a utf or iso encoding. If that is not practical, try to match the declaration to reflect the actual encoding used. If you chose to use numeric character references, make sure that you use the Unicode codepoint value rather than the codepoint in the native character set. Users on windows platform may find the, cp1252 to Unicode table helpful - of special interest is the mapping of characters in the 0x80 through 0x9F range.Oz
GamersMark - On Target Gaming
OzTheory - Programming and Web Solutions
AmIBlocked - Check if you've been blocked on IM
-
Apr 27, 2005, 19:40 #2
- Join Date
- Dec 2003
- Location
- Federal Way, Washington (USA)
- Posts
- 1,524
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Originally Posted by -Oz-
Could you show us the whole $get['content'] value prior to applying the str_replace? That might help in advising you on what to do.Music Around The World - Collecting tips, trade
and want lists, album reviews, & more
Showcase your music collection on the Web
-
Apr 27, 2005, 20:10 #3
- Join Date
- Nov 2001
- Location
- Phoenix, AZ, USA
- Posts
- 406
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
the content after the shortening and strip tags is:
Originally Posted by $get['content'
Oz
GamersMark - On Target Gaming
OzTheory - Programming and Web Solutions
AmIBlocked - Check if you've been blocked on IM
-
Apr 27, 2005, 21:09 #4
- Join Date
- Dec 2003
- Location
- Federal Way, Washington (USA)
- Posts
- 1,524
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
I'm not aware if there is any comprehensive list of characters anywhere that will trip you up as far as XML validation goes. Perhaps I haven't used a lot of funky characters such as your Euro dollar symbol, but when I encounter such a problem I generally just research it and find out what the equivalent ISO character set is and do a str_replace using the ISO character set, as you have done with those other special characters.
For what it's worth, the ISO character set equivalent of the Euro dollar is "& #8364;" (as all one string without the quotes).
Don't know if I offered all that much help but I tried.Music Around The World - Collecting tips, trade
and want lists, album reviews, & more
Showcase your music collection on the Web
-
Apr 27, 2005, 22:05 #5
- Join Date
- Nov 2001
- Location
- Phoenix, AZ, USA
- Posts
- 406
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
that will help. Does anyone know of a script or while loop that will go through an array of characters like that and replace each one?
Oz
GamersMark - On Target Gaming
OzTheory - Programming and Web Solutions
AmIBlocked - Check if you've been blocked on IM
-
Apr 28, 2005, 04:17 #6
- Join Date
- Dec 2003
- Location
- Federal Way, Washington (USA)
- Posts
- 1,524
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Music Around The World - Collecting tips, trade
and want lists, album reviews, & more
Showcase your music collection on the Web
-
Apr 28, 2005, 04:32 #7
- Join Date
- Dec 2003
- Location
- Federal Way, Washington (USA)
- Posts
- 1,524
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Just a follow-up on this thread. Leave it to SitePoint to come up with the perfectly timed newsletter!
SitePoint Tech Times #112 has an article titled Character Encodings and Input which should help you a bit with regard to your character problem. The articles it links to are rather long but look like they'll be worth spending some time reading.
Hope this helps.Music Around The World - Collecting tips, trade
and want lists, album reviews, & more
Showcase your music collection on the Web
-
Apr 28, 2005, 14:37 #8
- Join Date
- Nov 2001
- Location
- Phoenix, AZ, USA
- Posts
- 406
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Yeah, got that newsletter in my inbox and thought it was quite ironic.
Oz
GamersMark - On Target Gaming
OzTheory - Programming and Web Solutions
AmIBlocked - Check if you've been blocked on IM
-
Apr 28, 2005, 14:56 #9
- Join Date
- Feb 2005
- Location
- A box
- Posts
- 516
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Just str_replace will find all instances of the first argument and replace with the second argument, so $content = str_replace("£", "What to replace with", $get["content"]); would find all £'s and replace with "What to replace with" without quotes.
<(^.^<) \(^.^\) (^.^) (/^.^)/ (>^.^)>
Core 2 Duo E8400 clocked @ 3.375GHz, 2x2GB 800MHz DDR2 RAM
5x SATA drives totalling 2.5TB, 7900GS KO, 6600GT
-
Apr 28, 2005, 16:53 #10
- Join Date
- Nov 2001
- Location
- Phoenix, AZ, USA
- Posts
- 406
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Originally Posted by dmsuperman
Oz
GamersMark - On Target Gaming
OzTheory - Programming and Web Solutions
AmIBlocked - Check if you've been blocked on IM
-
Apr 28, 2005, 17:10 #11
- Join Date
- Dec 2003
- Location
- Albany, New York
- Posts
- 1,355
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Use an array:
PHP Code:$chars=array(
'badchar' => 'replacement',
'badchar' => 'replacement',
'badchar' => 'replacement',
'badchar' => 'replacement',
'badchar' => 'replacement',
'badchar' => 'replacement',
'badchar' => 'replacement'
);
$content=str_replace(array_keys($chars),array_values($chars),$content);
-
Apr 28, 2005, 18:05 #12
- Join Date
- Nov 2001
- Location
- Phoenix, AZ, USA
- Posts
- 406
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
I may have a fix. I'm trying this out right now:
PHP Code:function gb2unicode($gb)
{
if(!trim($gb))
return $gb;
$filename="http://www.yourdomain.com/cp1252.txt";
$tmp=file($filename);
$codetable=array();
while(list($key,$value)=each($tmp))
$codetable[hexdec(substr($value,0,6))]=substr($value,9,4);
$utf="";
while($gb)
{
if (ord(substr($gb,0,1))>127)
{
$this=substr($gb,0,2);
$gb=substr($gb,2,strlen($gb));
$utf.="&#x".$codetable[hexdec(bin2hex($this))-0x8080].";";
}
else
{
$utf.=substr($gb,0,1);
$gb=substr($gb,1,strlen($gb));
}
}
return $utf;
}
So far I think it works, no real way to test yet. The feed validates though.
OKAY, THIS DIDN'T WORK. for words like can't it made it can (with a space afterwards). Back to research.Oz
GamersMark - On Target Gaming
OzTheory - Programming and Web Solutions
AmIBlocked - Check if you've been blocked on IM
-
Apr 28, 2005, 19:25 #13
- Join Date
- Nov 2001
- Location
- Phoenix, AZ, USA
- Posts
- 406
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
I had enough looking around and wrote a function that covers every character I could come up with and replaces it with its &blah; code:
PHP Code:function utf8encode($text=""){
//Compiled by OzTheory.com
$chars=array(
'Ò' => 'Ò',
'Ó' => 'Ó',
'Ô' => 'Ô',
'Õ' => 'Õ',
'Ø' => 'Ø',
'Ù' => 'Ù',
'Ú' => 'Ú',
'Û' => 'Û',
'Ü' => 'Ü',
'ß' => 'ß',
'à' => 'à',
'á' => 'á',
'â' => 'â',
'ã' => 'ã',
'ä' => 'ä',
'å' => 'å',
'æ' => 'æ',
'ç' => 'ç',
'è' => 'è',
'é' => 'é',
'ê' => 'ê',
'ë' => 'ë',
'ì' => 'ì',
'í' => 'í',
'î' => 'î',
'ï' => 'ï',
'ñ' => 'ñ',
'ò' => 'ò',
'ó' => 'ó',
'ô' => 'ô',
'õ' => 'õ',
'ö' => 'ö',
'÷' => '÷',
'ø' => 'ø',
'ù' => 'ù',
'ú' => 'ú',
'û' => 'û',
'ü' => 'ü',
'ÿ' => 'ÿ',
'‚' => '‚',
'ƒ' => 'ƒ',
'„' => '„',
'…' => '…',
'†' => '†',
'‡' => '‡',
'ˆ' => 'ˆ',
'‰' => '‰',
'Œ' => 'Œ',
'–' => '–',
'—' => '—',
'˜' => '˜',
'™' => '™',
'œ' => 'œ',
'Ÿ' => 'Ÿ',
'Ñ' => 'Ñ',
'Ï' => 'Ï',
'Î' => 'Î',
'Í' => 'Í',
'Ì' => 'Ì',
'Ë' => 'Ë',
'Ê' => 'Ê',
'É' => 'É',
'È' => 'È',
'Ç' => 'Ç',
'Æ' => 'Æ',
'Å' => 'Å',
'Ä' => 'Ä',
'Ã' => 'Ã',
'Â' => 'Â',
'Á' => 'Á',
'À' => 'À',
'¿' => '¿',
'µ' => 'µ',
'±' => '±',
'°' => '°',
'®' => '®',
'©' => '©',
'¨' => '¨',
'§' => '§',
'¥' => '¥',
'£' => '£',
'€' => '€',
'¢' => '¢',
'¡' => '¡',
'’' => "'",
'‘' => "'",
'“' => '"',
'”' => '"',
'…' => '...',
"'" => '’'
);
$text=str_replace(array_keys($chars),array_values($chars),$text);
return $text;
}
Oz
GamersMark - On Target Gaming
OzTheory - Programming and Web Solutions
AmIBlocked - Check if you've been blocked on IM
-
Apr 28, 2005, 22:44 #14
- Join Date
- Feb 2001
- Location
- Melbourne Australia
- Posts
- 6,282
- Mentioned
- 1 Post(s)
- Tagged
- 0 Thread(s)
To filter out all non-ascii characters:
PHP Code:$text = preg_replace('/[^\x09\x0A\x0D\x20-\x7E]+/', '', $text);
If you're using ISO-8859-1 and want to allow extended characters you can use this instead.
PHP Code:$text = preg_replace('/[^\x09\x0A\x0D\x20-\x7E\xC0-\xFF]+/', '', $text);
[mmj] My magic jigsaw
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Bit Depth Blog · Twitter · Contact me
Neon Javascript Framework · Jokes · Android stuff
-
Apr 29, 2005, 10:20 #15
- Join Date
- Nov 2001
- Location
- Phoenix, AZ, USA
- Posts
- 406
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
wont both of those just get rid of the character completely, not replace it?
Oz
GamersMark - On Target Gaming
OzTheory - Programming and Web Solutions
AmIBlocked - Check if you've been blocked on IM
-
Apr 29, 2005, 23:17 #16
- Join Date
- Feb 2001
- Location
- Melbourne Australia
- Posts
- 6,282
- Mentioned
- 1 Post(s)
- Tagged
- 0 Thread(s)
Yes that's right. So you can use that as a last step, after you have tried to convert the characters to their equivalent in the correct character encoding.
[mmj] My magic jigsaw
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Bit Depth Blog · Twitter · Contact me
Neon Javascript Framework · Jokes · Android stuff
Bookmarks