SitePoint Sponsor

User Tag List

Results 1 to 14 of 14
  1. #1
    SitePoint Guru worchyld's Avatar
    Join Date
    Jul 2003
    Location
    Newcastle upon Tyne
    Posts
    909
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Making my RSS feed work with special characters

    Hi there.

    I'm using the following code to create the RSS feed on my site (please note I've edited it in places). My RSS feed is valid however it crashes when I put in pound symbols () or dollar symbols ($) or, indeed, any special character (ie: such as a %)

    How do I get my PHP-powered RSS feed to work around the special characters problem?

    I appreciate any help you can give on this subject.

    Here is the code;

    Code:
    <?php
    $pubDate = date("r");
    $year = date("Y");
    
    function iso_8601 ($txt_date) { 
    	$fDate = strtotime($txt_date);
    	$main_date = date("Y-m-d\TH:i:s", $fDate); 
    	$tz = date("O", $timestamp); 
    	$tz = substr_replace ($tz, ':', 3, 0); 
    	$return = $main_date . $tz; 
    	return $return; 
    } // end function
    
    header ("Content-type: text/xml");
    echo ("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n");
    ?>
    <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:hr="http://www.w3.org/2000/08/w3c-synd/#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/">
        <channel rdf:about="http://www.somesite.com/index.php">
            <title>TITLE</title>
            <description>DESCRIPTION</description>
            <link>LINK</link>
    	<language>en-gb</language>
    	<?php
    	echo ("<pubDate>$pubDate</pubDate>");
    	echo ("<copyright>Copyright $year. somesite.com</copyright>");
    	echo ("<webMaster>EMAIL_ADDRESS</webMaster>");
    	?>
        </channel>
    
    <?php
    while ($row = mysql_fetch_array ($result)) {
    	$postid = $row['post_id'];
    	$txt_date = $row['post_date'];
    	$txt_title = $row['post_headline'];
    	$txt_article = $row['post_article'];
    
    	$txt_title = stripslashes($txt_title);
    	$txt_title = fixDisplay($txt_title);
    	$txt_article = fixDisplay($txt_article);
    	$txt_article = strip_tags($txt_article);
    
    	$formatted = iso_8601($txt_date);
    	$articleLink = 'http://www.somesite.com/archives'.'/'.$postid;
    
    	// DO RSS DISPLAY
    	echo ("<item rdf:about=\"http://www.somesite.com\">
    	<title>");
    	echo $txt_title;
    	echo ("</title>
    	<description>");
    	echo $txt_article;
    	echo ("</description>
    	<link>");
    	echo $articleLink;
    	echo ("</link>
    	<dc:date>");
    	echo $formatted;
    	echo ("</dc:date>
    	</item>\n\n");
    }
    mysql_free_result ($result);
    ?></rdf:RDF> 
    
    mysql_close($connection);
    ?>

  2. #2
    SitePoint Zealot Caged's Avatar
    Join Date
    May 2003
    Location
    United States
    Posts
    107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Is it the php thats giving you the error, or the attempt to view RSS from the browser?

    You can use:
    Code:
    <element><![CDATA[Non Xml Content Here#$%@  ]]></element>
    The CDATA is used to define parts that you don't want parsed as XML. Assuming were both on the same page

  3. #3
    SitePoint Guru worchyld's Avatar
    Join Date
    Jul 2003
    Location
    Newcastle upon Tyne
    Posts
    909
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Sorry, I wasn't very clear in my last email where the error was coming from.

    The error is coming from actually displaying special characters such as a pound symbol.

    Example;
    Title:
    Woman wins 3,000,000 on UK Lottery

    Article:
    Today a woman won 3.2 million as part of a syndicate. Her winnings were 5% of what the syndicate won.
    Now, using the above code - replace:
    $txt_title with the title example I've used above.

    Replace:
    $txt_article with the article example I've used above.

    When you run the PHP code it complains that an error has been caused at line XX - its to do with the symbol, it cannot handle them in an XML format.

    Is there something I can do that'll help?

  4. #4
    SitePoint Guru worchyld's Avatar
    Join Date
    Jul 2003
    Location
    Newcastle upon Tyne
    Posts
    909
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Perhaps there's a different PHP to RSS code someone has that a) produces RSS Valid code and b) can display and other special characters / symbols without screwing up?

  5. #5
    SitePoint Addict Chillijam's Avatar
    Join Date
    Nov 2003
    Location
    England
    Posts
    293
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Assuming you are only interested in displaying the RSS feed results through a browser, how about
    PHP Code:
    str_replace("","&pound;",$txt_article); 
    when creating the feed?
    Your mind is like a parachute. It works best when open.
    (HH The Dalai Lama)

  6. #6
    Ceci n'est pas Zoef Zoef's Avatar
    Join Date
    Nov 2002
    Location
    Malta
    Posts
    1,111
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Chillijam
    Assuming you are only interested in displaying the RSS feed results through a browser, how about
    PHP Code:
    str_replace("","&pound;",$txt_article); 
    when creating the feed?
    Or maybe even better, use htmlentities.

    Rik
    English tea - Italian coffee - Maltese wine - Belgian beer - French Cognac

  7. #7
    SitePoint Guru worchyld's Avatar
    Join Date
    Jul 2003
    Location
    Newcastle upon Tyne
    Posts
    909
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ah, I never thought of that!

    --- a few minutes later ---

    Unforutnetly it keeps coming up with an error relating to the &pound; symbol too...

    Actual error message;

    The XML page cannot be displayed
    Cannot view XML input using XSL style sheet. Please correct the error and then click the Refresh button, or try again later.


    --------------------------------------------------------------------------------

    Reference to undefined entity 'pound'. Error processing resource '[MYDOMAN.COM]/rss.php'. Line 11, Position 21

    <title>Woman wins &pound;1,000,000 on UK lottery</title>
    --------------------^
    I've tried leaving the pound symbol in and using the Str_replace you've suggested, and I've just tried using the htmlentities - all produce the same error.

    Is there a way around this?

    Thanks for helping.

  8. #8
    One website at a time mmj's Avatar
    Join Date
    Feb 2001
    Location
    Melbourne Australia
    Posts
    6,282
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    To answer the original question,

    You have specified a character set of UTF-8. You have to be careful when doing this, because if you have characters in the document that aren't valid UTF-8 characters, then some XML parsers will die with an error. An XML parser must either ignore an invalid character, replace it with another (such as a question mark), or halt and display an error. You should ensure that the document is free of invalid characters before anyone needs to parse it.

    Was the original document UTF-8? If the original was ISO-8859-1, then change the charset of this feed to match. If you have absolutely no idea what character set you were using then it may have been ISO-8859-1 (but it may also contain some characters that aren't valid in ISO-8859-1).

    If you have absolutely no idea and you have characters that are not valid either in UTF-8 or in ISO-8859-1, then you will have to bite the bullet and just filter out all non-ASCII characters. Do this:

    $output = preg_replace('/[^\x20-\x7F]+/', '', $output);

    For more information about character sets, I recommend you read the Unicode FAQ (do a google).

    &pound;
    no!
    Or maybe even better, use htmlentities
    No! NEVER use any HTML entities in your XML files. I repeat: never user any HTML entities in your XML files. XML only supports these entities:

    &amp; &lt; &gt; &quot;

    Other HTML entities will NOT work in an XML document unless the reader is non-compliant (broken) or the XML file is in a format which allows them (RSS is not such a format). HTML entities other than these should ONLY be used in HTML documents, not in XML, or RSS, or plain text, or anything else.

    ----------
    General notes about character sets:

    Anybody who builds online applications should be aware of character sets. You should pick one character set, and everything should stick to this character set, because translating it is a hassle. I use UTF-8 for all data in the application I'm building.

    If you don't specify a character set for your output, then you are relying on the fact that the browser or whatever's reading your output happens to have the same default character set as your application, which it might not. For instance, when you POST data to a form, the POST data is sent in the same character set as the page. If the page doesn't have one, then the server has no way of knowing what character set the data it receives will be in. UTF-8 is better than ISO-8859-1 in most ways, because it's capable of having many thousands more characters than it, including all languages in existence, in the one character set.
    [mmj] My magic jigsaw
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    The Bit Depth Blog Twitter Contact me
    Neon Javascript Framework Jokes Android stuff

  9. #9
    SitePoint Guru worchyld's Avatar
    Join Date
    Jul 2003
    Location
    Newcastle upon Tyne
    Posts
    909
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks, mmj - that tutorial proved useful and enlightening. I'm not that hot on RSS feed, and I'm only doing it because everybody else seems to have a RSS feed, but I do not know anyone who actually uses one on a day-to-day basis.

    Thanks again mmj, I shall investigate UTF vs ISO somemore and experiment.

  10. #10
    SitePoint Guru worchyld's Avatar
    Join Date
    Jul 2003
    Location
    Newcastle upon Tyne
    Posts
    909
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hey, you were right - it was the UTF vs ISO thing - I've changed it so instead of it saying:

    Code:
    echo ("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n
    It now says:

    Code:
    echo ("<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n
    The RSS feed works now, and it validates too!

    Yey!

  11. #11
    SitePoint Zealot Hulkur's Avatar
    Join Date
    Oct 2001
    Location
    Estonia
    Posts
    141
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    i use function from comments to htmlentities
    Code:
    function xmlentities($string, $quote_style=ENT_COMPAT)
    {
       $trans = get_html_translation_table(HTML_ENTITIES, $quote_style);
    
       foreach ($trans as $key => $value)
           $trans[$key] = '&#'.ord($key).';';
    
       return strtr($string, $trans);
    }
    (2B) or (not 2B) = FF

  12. #12
    Ceci n'est pas Zoef Zoef's Avatar
    Join Date
    Nov 2002
    Location
    Malta
    Posts
    1,111
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by mmj
    No! NEVER use any HTML entities in your XML files. I repeat: never user any HTML entities in your XML files. XML only supports these entities:

    &amp; &lt; &gt; &quot;
    I stand corrected!

    I'm looking into XML and RSS with the idea of writing a decent reader/agregator and I must say that it can all be rather confusing. I'm finding it hard to gather the information I need. There's the 'introductory articles' which are a dime in a dozen. There's few 'practical guidelines' or 'best practice' articles out there that go a bit further then the simplest stuff. Even the specs are ambiguous at best

    These are a few of the questions I'm strugling with:
    • What is the deal with CDATA ? I'm seeing feeds that use it to 'embed' html within the description tags and I'm also seeing feeds that just have the HTML 'as is' in the description tags.
    • Which modules should a good reader support?
    • Can a RSS feed have more then one <image> or <textarea> in it?
    • Should an <item> element always be the last, or can other elements follow it?
    • With all of the above, what is the difference between versions.

    I want to do this right. So if anyone has any answers to these questions, or point me to some good resources, I'd be gratefull. And btw, please don't tell me to google... I've been googling for the last 2 weeks .

    Rik
    English tea - Italian coffee - Maltese wine - Belgian beer - French Cognac

  13. #13
    SitePoint Member
    Join Date
    May 2004
    Location
    Belgium
    Posts
    1
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    What is the deal with CDATA ? I'm seeing feeds that use it to 'embed' html within the description tags and I'm also seeing feeds that just have the HTML 'as is' in the description tags.
    You can use CDATA to order the parser to ignore the characters in this section.
    This might come in handy when you want to display characters that are not allowed in XML.
    For example a url to a specific forum post in an rss feed: "http://localhost/forum/index.php?showtopic=100&#entry504".
    The pound/hash (#) symbol will cause an error if you don't fit it in a CDATA section.

    This will render a correct '#' in XML:
    Code:
     <![CDATA[#]]>

  14. #14
    SitePoint Member
    Join Date
    Oct 2007
    Location
    Everywhere
    Posts
    3
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    The problem occurs at the stage of Post collection

    I had spend lot of time in solving this problem. I needed to post XML data, which is UTF-8 encoded. I tried with ISO-8859-1 also but the same problem. I noticed that the POST data was truncated at the first occurrence of "&amp;"

    As in valid XML "&" must necessarily be converted to special entity, when you post the same data using any form submitted through a browser, entire data is URLencoded. But when the same data is sent via POST method, using any other application, in my case it was VB Program, the data was truncated, even when I used form encoding as application/x-www-form-urlencoded

    Now I shall try reading the RAW POST DATA using PHP://INPUT

    Then that data must be urldecoded and HTML_ENTITY_DECODED as well. I think upon accessing raw post data, it should work. For now, I have converted special entity to differrent substitute as I need to finish the project .

    Regards,


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •