PHP and XML: Parsing RSS 1.0 Article

XML is springing up all over the Internet as a means to create standard data formats for the exchange of information between systems, irrelevant of their platform or technology. As you may already know, XML allows you to define your own custom markup languages similar to HTML and suited to whatever data you need to represent. A number of standard XML-based markup languages have been created to facilitate the exchange of common types of information. In this article, we’ll learn how to use PHP to read an XML document and display the data it contains as a Web page. The example we’ll use is a Resource Description Framework (RDF) Site Summary (RSS) 1.0 document, although the techniques presented here apply to any situation where you wish to parse XML data in a PHP script.

Key Takeaways

XML is increasingly used on the Internet for data exchange across different systems, and PHP can effectively parse XML data for web presentation.
RSS 1.0, an XML-based format, facilitates the distribution and syndication of website content as a ‘channel’, which can be processed using PHP.
PHP utilizes an event-based API for XML parsing, which involves setting up handlers for start tags, end tags, and character data to efficiently manage XML content.
The article provides a practical example of parsing an RSS 1.0 feed using PHP, with detailed code snippets and explanations for handling various XML elements.
Techniques discussed include creating an XML parser in PHP, setting up event handlers, and reading XML data in chunks to prevent memory overload.
The tutorial also offers an alternative object-oriented approach to handle XML parsing, which encapsulates the parsing logic within a class to avoid the use of global variables.

A Brief Tour Of RSS 1.0

RSS (previously stood for Rich Site Summary developed by Netscape, but now refers to RDF Site Summary, an updated and XML-compliant version of the Netscape technology) is an XML document format intended to describe, summarize, and distribute the contents of a Web site as a ‘channel’. Sites such as MoreOver.com and O’Reilly’s Meerkat process RSS feeds provided by news and other content sites and provide combined headline newsfeed services. RSS is currently developed by the RSS-DEV Working Group.

As with most XML document formats, the meaning of the document can be gleaned fairly easily simply by looking over a sample document. SitePoint.com provides summaries of its front-page articles in RSS format at https://www.sitepoint.com/rss.php. If you are using Internet Explorer 5 or later, you can view the current version of this XML document directly in your browser. For everyone else, here is the current SitePoint.com RSS file at the time of this writing:

<?xml version="1.0" encoding="utf-8"?> <rdf:RDF xmlns:rdf="https://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/">

<channel rdf:about=”https://www.sitepoint.com/rss.php”>
<title>SitePoint.com</title>
<description>Master the Web!</description>
<link>https://www.sitepoint.com/</link>

<item rdf:about=”http://www.PromotionBase.com/article/551″>
<title>Escape Search Engine Caching</title>
<description>Did you know that many search engines cache your pages?
While this practice can speed up a search, users might not see your
most recent site updates! Ralph shows how you can stop search engines
caching your pages.</description>
<link>http://www.PromotionBase.com/article/551</link>
</item>

<item rdf:about=”http://www.WebmasterBase.com/article/541″>
<title>Add JavaScript to Fireworks</title>
<description>Does your design need more pizazz? Add interactivity to
your site without learning JavaScript! Matt explains the creation of
JavaScript effects in Fireworks, and explores in detail the use of
this program’s tools.</description>
<link>http://www.WebmasterBase.com/article/541</link>
</item>

<item rdf:about=”http://www.eCommerceBase.com/article/552″>
<title>eMail Campaigns in 8 Steps – Part 2</title>
<description>Ok, so you’ve reeled in your prospects and they’re on
your mailing list. Now what? How do you communicate effectively, and
turn them into customers? Jason reveals all…</description>
<link>http://www.eCommerceBase.com/article/552</link>
</item>

<item rdf:about=”http://www.eCommerceBase.com/article/505″>
<title>The Need for a Written Website Contract</title>
<description>A written agreement is essential if you pay others to
design, build or maintain your Websites. Ivan explains the necessity
of contracts to those who work on the Web.</description>
<link>http://www.eCommerceBase.com/article/505</link>
</item>

<item rdf:about=”http://www.PromotionBase.com/article/556″>
<title>Search Engine Strategies 2001 – Conference Report</title>
<description>Sinewave Interactive’s Gavin Appel talks to Matt about
this year’s Search Engine Strategies conference. He outlines the
discussions and predictions of industry leaders.</description>
<link>http://www.PromotionBase.com/article/556</link>
</item>

<item rdf:about=”http://www.eCommerceBase.com/article/508″>
<title>Better eCommerce Questionnaire</title>
<description>Overhaul your ecommerce strategy now! Face up to the
tough questions with Lee, as he guides you through a simple process
to optimize your ecommerce strategy.</description>
<link>http://www.eCommerceBase.com/article/508</link>
</item>

</rdf:RDF>

As you can see, the file begins with a <channel> tag that contains the title, description, and URL of the site that the RSS file describes as well as a list of the <items> that the channel currently contains. This tag is then followed by an <item> tag for each of the articles that appear of the front page of SitePoint.com. For each, the title, description, and URL are provided. It should be noted that this is a bare-bones RSS file — many sites make use of standard extensions to the RSS format to include things like author names, images, and publication dates for the items in their channel, but for the purposes of this article this basic RSS file will do.

Now, since most Web browsers can’t read XML pages and the browsers that can only display the code of the page (Internet Explorer 5+) or the textual portions of the page (Netscape 6+) by default, you need some intermediate technology to convert this RSS document into something presentable if you want to display it to users. Other possibilities include reading the file and storing the headlines into a database, or emailing subscribed users if particular keywords appear in the descriptions of new articles. In any case, you’re going to need something that can read XML. Of the many options available in this arena, this article will examine the use of PHP to parse an XML document.

PHP’s Take on XML

There are two widely-used methods for programming languages to read XML documents: event-based APIs and Document Object Model (DOM) APIs. In the latter class of APIs, XML documents are read into memory in their entirety and can then be manipulated through a set of functions that provide access to an object oriented model of the document (the DOM) in memory. DOM APIs are generally considered to be more powerful; however, they suffer from one serious drawback: they are ill-suited to processing large XML documents, which would take too much memory to build the model of the document.

PHP uses an event-based API to process XML. In such models, the XML document is read in from beginning to end, setting off an event whenever a start tag, end tag, or block of character data is encountered. Each of these events causes a function of the programmer’s choice to be called. Thus, reading an XML document with an event-based API like that of PHP is simply a matter of writing the functions to react appropriately to the events that occur as PHP moves through the document.

Here’s the basic code for setting up event-handling functions and parsing (reading in) an XML document in PHP:

// Create an XML parser $xml_parser = xml_parser_create();

// Set the functions to handle opening and closing tags
xml_set_element_handler($xml_parser, "startElement", "endElement");

// Set the function to handle blocks of character data
xml_set_character_data_handler($xml_parser, "characterData");

// Open the XML file for reading
$fp = fopen("https://www.sitepoint.com/rss.php","r")
or die("Error reading RSS data.");

// Read the XML file 4KB at a time
while ($data = fread($fp, 4096))
// Parse each 4KB chunk with the XML parser created above
xml_parse($xml_parser, $data, feof($fp))
// Handle errors in parsing
or die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));

// Close the XML file
fclose($fp);

// Free up memory used by the XML parser
xml_parser_free($xml_parser);

Each of the lines of the above code are commented to explain what they do, but let’s look at the XML-related PHP functions that are used in the above code in a little more detail:

xml_parser_create() Creates an XML parser. Just as you must create a database connection in PHP if you want to interact with a database, you must create an XML parser to use when you want to read in an XML file. In the above example, a reference to the parser is stored in $xml_parser.
xml_set_element_handler(parser, startElementFunction, endElementFunction) This function specifies the functions that an XML parser should use to process the events generated opening and closing tags. In this case, the parser is the one stored in our $xml_parser variable, while the functions are called startElement and endElement. These functions will be defined elsewhere in the PHP script (I’ll give an example below).
xml_set_character_data_handler(parser, characterDataFunction) This function specifies the function that the XML parser should use to process character data appearing between tags in an XML document. Once again we use our $xml_parser variable in the example above. The function we choose to process character data is called characterData.
xml_parse(parser, data, endOfDocument) This function sends all or part of an XML document to the parser for it to process. The endOfDocument parameter should be set to true if the data marks the end of of XML document, or false if more of the document will follow in a subsequent call to xml_parse. This allows the parser to correctly catch unclosed tags at the end of the document and so forth. In our example, the parser is once again $xml_parser. The $data variable (up to 4KB in size) retrieved from the file with fread is passed as the data to be processed, while the feof function is used to determine whether PHP has reached the end of the XML file or not, thus providing the required endOfDocument parameter. If an error occurs in the parsing of the document, we print out the error message and the line of the file at which it occurs with xml_error_string, xml_get_error_code and xml_get_current_line_number, all of which are described in detail in the PHP manual if you’re curious.
xml_parser_free(parser) Although all memory resources are freed at the end of a PHP script, you may wish to free up the memory used by the XML parser if your script will perform other potentially memory-intensive tasks after it parses the XML data. This function destroys the specified XML parser, thus freeing up resources and memory it may have allocated for parsing.

There are a few additional functions that let you handle some of the more esoteric events that occur during XML parsing, but these are well documented in the PHP manual so I’ll leave them for you to read up on if your particular application requires them. For our purposes (reading an RSS file), we now have everything we need. All that’s left is to write the three event handling functions: startElement, endElement, and characterData.

Event Handlers for RSS Parsing

We have three functions to write. Each of these functions must take certain parameters. These parameters are dictated by PHP, since it is PHP’s XML parser that will call them. Here are the attributes that you must define for each of these functions:

startElement($parser, $tagName, $attrs)

$parser will be passed a reference to the XML parser that is being used to parse the document.
$tagName is the ALL-UPPERCASE (the PHP manual calls this ‘case-folded’) version of the name of the opening tag that triggered the event.
$attrs is an associative array of the attributes that are present in the tag that triggered the event. For example, if the tag <body bgcolor="#FFFFFF"> triggered the event, then the value of $attrs['BGCOLOR'] would be "#FFFFFF". Note that, like the tag name, attribute names are case-folded (all uppercase).

endElement($parser, $tagName)

$parser will be passed a reference to the XML parser that is being used to parse the document.
$tagName is the case-folded name of the closing tag that triggered the event.

characterData($parser, $data)

$parser will be passed a reference to the XML parser that is being used to parse the document.
$data is a string of text appearing between XML tags in the document. The text between two tags will not necessarily trigger a single event. Blocks of text spread over multiple lines will cause one event per line, with each event being passed the $data for that line.

With this in mind, the process of converting the XML data for SitePoint’s RSS file into a viewable HTML document may seem fairly straightforward at first glance. If you stop and try to work out what the three event handling functions should do, however, you’ll quickly realise that it’s not quite as simple as it seems. For those of you who may be feeling lost at this stage, don’t worry. Looking at definitions for these functions that will process SitePoint’s RSS file (or indeed any site’s RSS file) should help it all make sense.

The first complexity that may strike you is that the characterData function must react to text appearing between tags, but nothing is passed to the function to tell it which tags contain the text being processed. For this reason, most XML parsing scripts will need to define a set of global variables to track information received by one of the event-handling functions for use by the others.

In the case of our RSS file, all the information we need about the articles on SitePoint’s cover page is contained in the <item> tags in the document. So the first global variable we’ll define will be $insideitem, which we’ll set to true when entering an <item> tag and false when exiting one. We’ll also define four other variables, the purposes for which will become clear as we move along:

$insideitem = false; $tag = ""; $title = ""; $description = ""; $link = "";

Let’s begin with startElement. This function will be called by the XML parser whenever an opening tag is encountered. Since we’re only really interested in what goes on between <item> tags, we’ll first check if we are indeed inside an <item> tag:

function startElement($parser, $tagName, $attrs) { global $insideitem, $tag; if ($insideitem) {

Note the global statement at the start of the function, which indicates that this function will need access to the $insideitem and $tag global variables. Now, if $insideitem is true, it means we’re going to want to take note of the tag that is starting so we know what to do with the character data it contains, which will trigger a call to characterData next. So we record the name of the tag ($tagName) in our global $tag variable:

$tag = $tagName;

If, on the other hand, we’re not inside an <item> tag, then the only opening tag that we could possibly be interested in would be an actual <item> tag, in which case we would set $insideitem to true to indicate that we were entering one of these tags:

} elseif ($tagName == "ITEM") { $insideitem = true; } }

Note that we are checking if $tagName is "ITEM", since tag names are case-folded to all uppercase.

That does it for opening tags. The next step in parsing our RSS document is handling the character data that appears between tags, and that’s the job of our characterData function:

function characterData($parser, $data) { global $insideitem, $tag, $title, $description, $link;

This function requires access to all five of our global variables, as we’ll see shortly. Now, as before, the only time we are interested in the character data in the XML file is when we are inside an <item> tag, so the first step again is to check if that is the case:

if ($insideitem) {

Now, there are three different tags that can appear inside <item> tags that we are interested in: <title>, <description> and <link>. Now, since we want to display the title of each article above its description and with a link to the URL specified in the <link> tag, we can’t simply output the character data as it is encountered in the XML file. Instead, we need to collect all the data for each <item> tag and then print it all out at once. Our global $title, $description and $link variables will be used for this exact purpose. We will use a switch statement to determine which tag we are dealing with and store the $data in the corresponding variable. Recall that the name of the current tag is stored in the global $tag variable.

switch ($tag) { case "TITLE": $title .= $data; break; case "DESCRIPTION": $description .= $data; break; case "LINK": $link .= $data; break; } } }

Note that we append (.=) the $data to the variable in question, rather than simply assigning it (=) because the contents of a single tag can be received as several consecutive characterData events.

Once the character data for a tag has been processed, the next event to occur will call our endElement function to indicate the closing tag. In this application, the only tag that will require action on our part following its closing is the <item> tag. When the </item> tag is encountered, we will have retrieved all the $title, $description and $link data for the item, and so we can then output it as HTML:

function endElement($parser, $tagName) { global $insideitem, $tag, $title, $description, $link; if ($tagName == "ITEM") { printf("<a href='%s'>%s</a>", trim($link),htmlspecialchars(trim($title))); printf("%s",htmlspecialchars(trim($description)));

Feel free to use echo statements if you’re not used to the more convenient printf function I used above. In either case, once you’ve output the URL, title and description of the <item>, you can clear the global variables so that they’re ready to receive the character data for the next <item> in the document:

$title = ""; $description = ""; $link = "";

And then finally set $insideitem to false to indicate to our other functions that we are no longer inside an <item> tag.

$insideitem = false; } }

That’s it! To see this script in action, click here. You can also see the complete source code (use the view source command in your browser if the source code isn’t displayed as a text file).

Another OOPtion

The more experienced programmers in the audience may have cringed as soon as I mentioned the use of global variables. There is a school of thought that global variables are merely a sign of lazy programming, and indeed PHP’s Object Oriented Programming (OOP) features provide a better option. Here’s an alternate version of the script we have just developed. Instead of functions to handle the XML document events, we use the methods of a PHP object. The data that these methods must share can then be stored as instance variables of the object, thus eliminating the need for global variables in our script.

class RSSParser {

var $insideitem = false;
var $tag = "";
var $title = "";
var $description = "";
var $link = "";

function startElement($parser, $tagName, $attrs) {
if ($this->insideitem) {
$this->tag = $tagName;
} elseif ($tagName == "ITEM") {
$this->insideitem = true;
}
}

function endElement($parser, $tagName) {
if ($tagName == "ITEM") {
printf("<a href='%s'>%s</a>",
trim($this->link),htmlspecialchars(trim($this->title)));
printf("%s",
htmlspecialchars(trim($this->description)));
$this->title = "";
$this->description = "";
$this->link = "";
$this->insideitem = false;
}
}

$xml_parser = xml_parser_create();
$rss_parser = new RSSParser();
xml_set_object($xml_parser,&$rss_parser);
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");
$fp = fopen("https://www.sitepoint.com/rss.php","r")
or die("Error reading RSS data.");
while ($data = fread($fp, 4096))
xml_parse($xml_parser, $data, feof($fp))
or die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
fclose($fp);
xml_parser_free($xml_parser);

If you’re familiar with object oriented programming in PHP, then the only line that may be a source of consternation for you is the following:

xml_set_object($xml_parser,&$rss_parser);

Since xml_set_element_handler and xml_set_character_data_handler cannot take references to the methods of objects as the function names of the event handlers (i.e. xml_set_element_handler($xml_parser,"$rss_parser->startElement",... won’t work), you need a way to tell the XML parser to call methods of the $rss_parser object instead of basic functions. xml_set_object does just that, taking the parser as well as a reference (note the &) to the object whose methods you want it to call.

Download the full code of this Object Oriented version here.

Summary and Resources for Further Reading

In this article we learned how to parse XML documents (taking the specific example of a RSS 1.0 document) using PHP. Although its event-based API for XML document processing can require some thought to determine how to process even a simple XML document structure, we saw that thinking of the events as they occur in the processing of the document can help to determine what each of the event handling methods need to do.

In a fully-coded example, I demonstrated how to parse SitePoint’s RSS file, which lists the cover articles on SitePoint.com at any given time. Feel free to adapt this code for use on your site to list the current headlines on SitePoint.com in whatever format suits your site! For the programming purists out there, I provided an alternate version of the code that avoids the use of global variables by encapsulating the event handlers and the variables they must share inside a PHP class. With a little more work, you could encapsulate all of the XML processing code within the class and place that class in an include file, thus reducing the actual code in your document for displaying the SitePoint headlines to a two-line affair of creating your RSSParser object and then passing one of its methods the URL of SitePoint’s RSS file (https://www.sitepoint.com/rss.php).

For more information on the RSS 1.0 specification, visit the RSS Working Group’s Web site. This site also includes links to information about the older RSS 0.9x standards that are still in use by many sites on the Web today, as RSS 1.0 is still fairly new. You’ll be happy to know that the RSS 0.9x formats are just as easy to parse with PHP as RSS 1.0, if not moreso.

To see the code involved in parsing RSS 0.91 documents, see Mark Robards’ article, Parsing XML With PHP. This short but informative article demonstrates a technique for parsing XML data with PHP that minimizes the number of global variables (or instance variables, if you use an Object Oriented approach) required. Personally I prefer readable code to clever tricks like these, but this technique could considerably simplify the parsing of very complex XML documents.

Frequently Asked Questions about PHP XML Parsing and RSS 1.0

How can I parse RSS feeds using PHP?

Parsing RSS feeds using PHP involves using the built-in SimpleXML function. This function allows you to convert XML into an object that can be processed with normal property selectors and array iterators. To parse an RSS feed, you first need to load the XML file using the simplexml_load_file() function. Then, you can loop through each item in the RSS feed and extract the data you need.

What is the difference between RSS 1.0 and other versions of RSS?

RSS 1.0 is based on the Resource Description Framework (RDF), while other versions like RSS 2.0 are not. This means that RSS 1.0 can use RDF’s capabilities for metadata and for linking other modules. This makes RSS 1.0 more extensible and flexible compared to other versions.

How can I handle errors when parsing XML with PHP?

PHP provides several ways to handle errors when parsing XML. One way is to use the libxml_use_internal_errors() function to disable standard libxml errors and enable user error handling. Then, you can use the libxml_get_errors() function to get an array of errors.

How can I parse RSS feeds in JavaScript?

While this article focuses on parsing RSS feeds using PHP, it’s also possible to do this in JavaScript. You can use the fetch API to get the RSS feed, and then use a library like rss-parser to parse the feed.

Can I use PHP to parse other types of XML, not just RSS feeds?

Yes, PHP’s SimpleXML function can be used to parse any well-formed XML document, not just RSS feeds. This makes it a versatile tool for working with XML in PHP.

How can I display the parsed RSS feed data on my website?

Once you’ve parsed the RSS feed data, you can display it on your website using PHP’s echo statement. You can format the data as needed using HTML and CSS.

Can I parse RSS feeds from multiple sources at once using PHP?

Yes, you can parse RSS feeds from multiple sources at once using PHP. You would need to load and parse each RSS feed separately, and then combine the results.

How can I filter the parsed RSS feed data?

You can filter the parsed RSS feed data by adding conditions in your PHP code. For example, you could only display items that have a certain keyword in their title or description.

Can I use PHP to create my own RSS feed?

Yes, you can use PHP to create your own RSS feed. You would need to create an XML document in the correct format, and then add items to the feed using PHP.

How can I update the parsed RSS feed data automatically?

To update the parsed RSS feed data automatically, you could set up a cron job that runs your PHP script at regular intervals. This would ensure that your website always displays the latest data from the RSS feed.