Go Back   SitePoint Forums > Forum Index > Program Your Site > PHP
Newsletter FAQ Members List Calendar Mark Forums Read

New to SitePoint Forums? Register here for free!

SitePoint Sponsor
 
Reply
 
Thread Tools Display Modes
Old Mar 12, 2004, 02:57   #1
worchyld
SitePoint Guru
 
worchyld's Avatar
 
Join Date: Jul 2003
Location: Newcastle upon Tyne
Posts: 930
Making my RSS feed work with special characters

Hi there.

I'm using the following code to create the RSS feed on my site (please note I've edited it in places). My RSS feed is valid however it crashes when I put in pound symbols (£) or dollar symbols ($) or, indeed, any special character (ie: such as a %)

How do I get my PHP-powered RSS feed to work around the special characters problem?

I appreciate any help you can give on this subject.

Here is the code;

Code:
<?php
$pubDate = date("r");
$year = date("Y");

function iso_8601 ($txt_date) { 
	$fDate = strtotime($txt_date);
	$main_date = date("Y-m-d\TH:i:s", $fDate); 
	$tz = date("O", $timestamp); 
	$tz = substr_replace ($tz, ':', 3, 0); 
	$return = $main_date . $tz; 
	return $return; 
} // end function

header ("Content-type: text/xml");
echo ("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n");
?>
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:hr="http://www.w3.org/2000/08/w3c-synd/#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/">
    <channel rdf:about="http://www.somesite.com/index.php">
        <title>TITLE</title>
        <description>DESCRIPTION</description>
        <link>LINK</link>
	<language>en-gb</language>
	<?php
	echo ("<pubDate>$pubDate</pubDate>");
	echo ("<copyright>Copyright $year. somesite.com</copyright>");
	echo ("<webMaster>EMAIL_ADDRESS</webMaster>");
	?>
    </channel>

<?php
while ($row = mysql_fetch_array ($result)) {
	$postid = $row['post_id'];
	$txt_date = $row['post_date'];
	$txt_title = $row['post_headline'];
	$txt_article = $row['post_article'];

	$txt_title = stripslashes($txt_title);
	$txt_title = fixDisplay($txt_title);
	$txt_article = fixDisplay($txt_article);
	$txt_article = strip_tags($txt_article);

	$formatted = iso_8601($txt_date);
	$articleLink = 'http://www.somesite.com/archives'.'/'.$postid;

	// DO RSS DISPLAY
	echo ("<item rdf:about=\"http://www.somesite.com\">
	<title>");
	echo $txt_title;
	echo ("</title>
	<description>");
	echo $txt_article;
	echo ("</description>
	<link>");
	echo $articleLink;
	echo ("</link>
	<dc:date>");
	echo $formatted;
	echo ("</dc:date>
	</item>\n\n");
}
mysql_free_result ($result);
?></rdf:RDF> 

mysql_close($connection);
?>
worchyld is offline   Reply With Quote
Old Mar 12, 2004, 05:21   #2
Caged
SitePoint Zealot
 
Caged's Avatar
 
Join Date: May 2003
Location: United States
Posts: 108
Is it the php thats giving you the error, or the attempt to view RSS from the browser?

You can use:
Code:
<element><![CDATA[Non Xml Content Here#$%@  ]]></element>
The CDATA is used to define parts that you don't want parsed as XML. Assuming were both on the same page
Caged is offline   Reply With Quote
Old Mar 12, 2004, 06:51   #3
worchyld
SitePoint Guru
 
worchyld's Avatar
 
Join Date: Jul 2003
Location: Newcastle upon Tyne
Posts: 930
Sorry, I wasn't very clear in my last email where the error was coming from.

The error is coming from actually displaying special characters such as a pound symbol.

Example;
Quote:
Title:
Woman wins £3,000,000 on UK Lottery

Article:
Today a woman won £3.2 million as part of a syndicate. Her winnings were 5% of what the syndicate won.
Now, using the above code - replace:
$txt_title with the title example I've used above.

Replace:
$txt_article with the article example I've used above.

When you run the PHP code it complains that an error has been caused at line XX - its to do with the £ symbol, it cannot handle them in an XML format.

Is there something I can do that'll help?
worchyld is offline   Reply With Quote
Old Mar 15, 2004, 05:23   #4
worchyld
SitePoint Guru
 
worchyld's Avatar
 
Join Date: Jul 2003
Location: Newcastle upon Tyne
Posts: 930
Perhaps there's a different PHP to RSS code someone has that a) produces RSS Valid code and b) can display £ and other special characters / symbols without screwing up?
worchyld is offline   Reply With Quote
Old Mar 15, 2004, 05:41   #5
Chillijam
SitePoint Addict
 
Chillijam's Avatar
 
Join Date: Nov 2003
Location: England
Posts: 293
Assuming you are only interested in displaying the RSS feed results through a browser, how about
PHP Code:

str_replace("£","&pound;",$txt_article); 

when creating the feed?
__________________
Your mind is like a parachute. It works best when open.
(HH The Dalai Lama)
Chillijam is offline   Reply With Quote
Old Mar 15, 2004, 10:46   #6
Zoef
Ceci n'est pas Zoef
 
Zoef's Avatar
 
Join Date: Nov 2002
Location: Malta
Posts: 1,112
Quote:
Originally Posted by Chillijam
Assuming you are only interested in displaying the RSS feed results through a browser, how about
PHP Code:

str_replace("£","&pound;",$txt_article); 

when creating the feed?
Or maybe even better, use htmlentities.

Rik
__________________
English tea - Italian coffee - Maltese wine - Belgian beer - French Cognac
Zoef is offline   Reply With Quote
Old Mar 16, 2004, 02:06   #7
worchyld
SitePoint Guru
 
worchyld's Avatar
 
Join Date: Jul 2003
Location: Newcastle upon Tyne
Posts: 930
Ah, I never thought of that!

--- a few minutes later ---

Unforutnetly it keeps coming up with an error relating to the &pound; symbol too...

Actual error message;

Quote:
The XML page cannot be displayed
Cannot view XML input using XSL style sheet. Please correct the error and then click the Refresh button, or try again later.


--------------------------------------------------------------------------------

Reference to undefined entity 'pound'. Error processing resource '[MYDOMAN.COM]/rss.php'. Line 11, Position 21

<title>Woman wins &pound;1,000,000 on UK lottery</title>
--------------------^
I've tried leaving the pound symbol in and using the Str_replace you've suggested, and I've just tried using the htmlentities - all produce the same error.

Is there a way around this?

Thanks for helping.
worchyld is offline   Reply With Quote
Old Mar 16, 2004, 02:40   #8
mmj
Test cases complete. 0 fails.
 
mmj's Avatar
 
Join Date: Feb 2001
Location: Melbourne Australia
Posts: 6,569
To answer the original question,

You have specified a character set of UTF-8. You have to be careful when doing this, because if you have characters in the document that aren't valid UTF-8 characters, then some XML parsers will die with an error. An XML parser must either ignore an invalid character, replace it with another (such as a question mark), or halt and display an error. You should ensure that the document is free of invalid characters before anyone needs to parse it.

Was the original document UTF-8? If the original was ISO-8859-1, then change the charset of this feed to match. If you have absolutely no idea what character set you were using then it may have been ISO-8859-1 (but it may also contain some characters that aren't valid in ISO-8859-1).

If you have absolutely no idea and you have characters that are not valid either in UTF-8 or in ISO-8859-1, then you will have to bite the bullet and just filter out all non-ASCII characters. Do this:

$output = preg_replace('/[^\x20-\x7F]+/', '', $output);

For more information about character sets, I recommend you read the Unicode FAQ (do a google).

Quote:
&pound;
no!
Quote:
Or maybe even better, use htmlentities
No! NEVER use any HTML entities in your XML files. I repeat: never user any HTML entities in your XML files. XML only supports these entities:

&amp; &lt; &gt; &quot;

Other HTML entities will NOT work in an XML document unless the reader is non-compliant (broken) or the XML file is in a format which allows them (RSS is not such a format). HTML entities other than these should ONLY be used in HTML documents, not in XML, or RSS, or plain text, or anything else.

----------
General notes about character sets:

Anybody who builds online applications should be aware of character sets. You should pick one character set, and everything should stick to this character set, because translating it is a hassle. I use UTF-8 for all data in the application I'm building.

If you don't specify a character set for your output, then you are relying on the fact that the browser or whatever's reading your output happens to have the same default character set as your application, which it might not. For instance, when you POST data to a form, the POST data is sent in the same character set as the page. If the page doesn't have one, then the server has no way of knowing what character set the data it receives will be in. UTF-8 is better than ISO-8859-1 in most ways, because it's capable of having many thousands more characters than it, including all languages in existence, in the one character set.
__________________
[mmj] My momentous journey
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Bit Depth Blog · Twitter · Contact me
Spuds Jokes Bazaar VCS Inkscape Firefox phpBB
mmj is offline   Reply With Quote
Old Mar 16, 2004, 04:30   #9
worchyld
SitePoint Guru
 
worchyld's Avatar
 
Join Date: Jul 2003
Location: Newcastle upon Tyne
Posts: 930
Thanks, mmj - that tutorial proved useful and enlightening. I'm not that hot on RSS feed, and I'm only doing it because everybody else seems to have a RSS feed, but I do not know anyone who actually uses one on a day-to-day basis.

Thanks again mmj, I shall investigate UTF vs ISO somemore and experiment.
worchyld is offline   Reply With Quote
Old Mar 16, 2004, 04:37   #10
worchyld
SitePoint Guru
 
worchyld's Avatar
 
Join Date: Jul 2003
Location: Newcastle upon Tyne
Posts: 930
Hey, you were right - it was the UTF vs ISO thing - I've changed it so instead of it saying:

Code:
echo ("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n
It now says:

Code:
echo ("<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n
The RSS feed works now, and it validates too!

Yey!
worchyld is offline   Reply With Quote
Old Mar 16, 2004, 06:20   #11
Hulkur
SitePoint Zealot
 
Hulkur's Avatar
 
Join Date: Oct 2001
Location: Estonia
Posts: 141
i use function from comments to htmlentities
Code:
function xmlentities($string, $quote_style=ENT_COMPAT)
{
   $trans = get_html_translation_table(HTML_ENTITIES, $quote_style);

   foreach ($trans as $key => $value)
       $trans[$key] = '&#'.ord($key).';';

   return strtr($string, $trans);
}
__________________
(2B) or (not 2B) = FF
Hulkur is offline   Reply With Quote
Old Mar 16, 2004, 11:32   #12
Zoef
Ceci n'est pas Zoef
 
Zoef's Avatar
 
Join Date: Nov 2002
Location: Malta
Posts: 1,112
Quote:
Originally Posted by mmj
No! NEVER use any HTML entities in your XML files. I repeat: never user any HTML entities in your XML files. XML only supports these entities:

&amp; &lt; &gt; &quot;
I stand corrected!

I'm looking into XML and RSS with the idea of writing a decent reader/agregator and I must say that it can all be rather confusing. I'm finding it hard to gather the information I need. There's the 'introductory articles' which are a dime in a dozen. There's few 'practical guidelines' or 'best practice' articles out there that go a bit further then the simplest stuff. Even the specs are ambiguous at best

These are a few of the questions I'm strugling with:
  • What is the deal with CDATA ? I'm seeing feeds that use it to 'embed' html within the description tags and I'm also seeing feeds that just have the HTML 'as is' in the description tags.
  • Which modules should a good reader support?
  • Can a RSS feed have more then one <image> or <textarea> in it?
  • Should an <item> element always be the last, or can other elements follow it?
  • With all of the above, what is the difference between versions.
I want to do this right. So if anyone has any answers to these questions, or point me to some good resources, I'd be gratefull. And btw, please don't tell me to google... I've been googling for the last 2 weeks .

Rik
__________________
English tea - Italian coffee - Maltese wine - Belgian beer - French Cognac
Zoef is offline   Reply With Quote
Old May 13, 2004, 18:46   #13
lesterix
SitePoint Member
 
Join Date: May 2004
Location: Belgium
Posts: 1
Quote:
What is the deal with CDATA ? I'm seeing feeds that use it to 'embed' html within the description tags and I'm also seeing feeds that just have the HTML 'as is' in the description tags.
You can use CDATA to order the parser to ignore the characters in this section.
This might come in handy when you want to display characters that are not allowed in XML.
For example a url to a specific forum post in an rss feed: "http://localhost/forum/index.php?showtopic=100&#entry504".
The pound/hash (#) symbol will cause an error if you don't fit it in a CDATA section.

This will render a correct '#' in XML:
Code:
 <![CDATA[#]]>
lesterix is offline   Reply With Quote
Old Oct 14, 2007, 10:42   #14
integral.india
SitePoint Member
 
Join Date: Oct 2007
Location: Everywhere
Posts: 3
The problem occurs at the stage of Post collection

I had spend lot of time in solving this problem. I needed to post XML data, which is UTF-8 encoded. I tried with ISO-8859-1 also but the same problem. I noticed that the POST data was truncated at the first occurrence of "&amp;"

As in valid XML "&" must necessarily be converted to special entity, when you post the same data using any form submitted through a browser, entire data is URLencoded. But when the same data is sent via POST method, using any other application, in my case it was VB Program, the data was truncated, even when I used form encoding as application/x-www-form-urlencoded

Now I shall try reading the RAW POST DATA using PHP://INPUT

Then that data must be urldecoded and HTML_ENTITY_DECODED as well. I think upon accessing raw post data, it should work. For now, I have converted special entity to differrent substitute as I need to finish the project .

Regards,
integral.india is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread | Next Thread »

Thread Tools
Display Modes

 
Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

 
Forum Jump


All times are GMT -7. The time now is 09:23.


Powered by vBulletin® Version 3.8.5
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Copyright 1998-2009, SitePoint Pty Ltd. All Rights Reserved