PHP and XML: Parsing RSS 1.0 Article

Tweet

XML is springing up all over the Internet as a means to create standard data formats for the exchange of information between systems, irrelevant of their platform or technology. As you may already know, XML allows you to define your own custom markup languages similar to HTML and suited to whatever data you need to represent. A number of standard XML-based markup languages have been created to facilitate the exchange of common types of information. In this article, we’ll learn how to use PHP to read an XML document and display the data it contains as a Web page. The example we’ll use is a Resource Description Framework (RDF) Site Summary (RSS) 1.0 document, although the techniques presented here apply to any situation where you wish to parse XML data in a PHP script.

A Brief Tour Of RSS 1.0

RSS (previously stood for Rich Site Summary developed by Netscape, but now refers to RDF Site Summary, an updated and XML-compliant version of the Netscape technology) is an XML document format intended to describe, summarize, and distribute the contents of a Web site as a ‘channel’. Sites such as MoreOver.com and O’Reilly’s Meerkat process RSS feeds provided by news and other content sites and provide combined headline newsfeed services. RSS is currently developed by the RSS-DEV Working Group.

As with most XML document formats, the meaning of the document can be gleaned fairly easily simply by looking over a sample document. SitePoint.com provides summaries of its front-page articles in RSS format at http://www.sitepoint.com/rss.php. If you are using Internet Explorer 5 or later, you can view the current version of this XML document directly in your browser. For everyone else, here is the current SitePoint.com RSS file at the time of this writing:

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://purl.org/rss/1.0/">

<channel rdf:about=”http://www.sitepoint.com/rss.php”>
<title>SitePoint.com</title>
<description>Master the Web!</description>
<link>http://www.sitepoint.com/</link>

<items>
<rdf:Seq>
<rdf:li rdf:resource=”http://www.PromotionBase.com/article/551″/>
<rdf:li rdf:resource=”http://www.WebmasterBase.com/article/541″/>
<rdf:li rdf:resource=”http://www.eCommerceBase.com/article/552″/>
<rdf:li rdf:resource=”http://www.eCommerceBase.com/article/505″/>
<rdf:li rdf:resource=”http://www.PromotionBase.com/article/556″/>
<rdf:li rdf:resource=”http://www.eCommerceBase.com/article/508″/>
</rdf:Seq>
</items>
</channel>

<item rdf:about=”http://www.PromotionBase.com/article/551″>
<title>Escape Search Engine Caching</title>
<description>Did you know that many search engines cache your pages?
While this practice can speed up a search, users might not see your
most recent site updates! Ralph shows how you can stop search engines
caching your pages.</description>
<link>http://www.PromotionBase.com/article/551</link>
</item>

<item rdf:about=”http://www.WebmasterBase.com/article/541″>
<title>Add JavaScript to Fireworks</title>
<description>Does your design need more pizazz? Add interactivity to
your site without learning JavaScript! Matt explains the creation of
JavaScript effects in Fireworks, and explores in detail the use of
this program’s tools.</description>
<link>http://www.WebmasterBase.com/article/541</link>
</item>

<item rdf:about=”http://www.eCommerceBase.com/article/552″>
<title>eMail Campaigns in 8 Steps – Part 2</title>
<description>Ok, so you’ve reeled in your prospects and they’re on
your mailing list. Now what? How do you communicate effectively, and
turn them into customers? Jason reveals all…</description>
<link>http://www.eCommerceBase.com/article/552</link>
</item>

<item rdf:about=”http://www.eCommerceBase.com/article/505″>
<title>The Need for a Written Website Contract</title>
<description>A written agreement is essential if you pay others to
design, build or maintain your Websites. Ivan explains the necessity
of contracts to those who work on the Web.</description>
<link>http://www.eCommerceBase.com/article/505</link>
</item>

<item rdf:about=”http://www.PromotionBase.com/article/556″>
<title>Search Engine Strategies 2001 – Conference Report</title>
<description>Sinewave Interactive’s Gavin Appel talks to Matt about
this year’s Search Engine Strategies conference. He outlines the
discussions and predictions of industry leaders.</description>
<link>http://www.PromotionBase.com/article/556</link>
</item>

<item rdf:about=”http://www.eCommerceBase.com/article/508″>
<title>Better eCommerce Questionnaire</title>
<description>Overhaul your ecommerce strategy now! Face up to the
tough questions with Lee, as he guides you through a simple process
to optimize your ecommerce strategy.</description>
<link>http://www.eCommerceBase.com/article/508</link>
</item>

</rdf:RDF>

As you can see, the file begins with a <channel> tag that contains the title, description, and URL of the site that the RSS file describes as well as a list of the <items> that the channel currently contains. This tag is then followed by an <item> tag for each of the articles that appear of the front page of SitePoint.com. For each, the title, description, and URL are provided. It should be noted that this is a bare-bones RSS file — many sites make use of standard extensions to the RSS format to include things like author names, images, and publication dates for the items in their channel, but for the purposes of this article this basic RSS file will do.

Now, since most Web browsers can’t read XML pages and the browsers that can only display the code of the page (Internet Explorer 5+) or the textual portions of the page (Netscape 6+) by default, you need some intermediate technology to convert this RSS document into something presentable if you want to display it to users. Other possibilities include reading the file and storing the headlines into a database, or emailing subscribed users if particular keywords appear in the descriptions of new articles. In any case, you’re going to need something that can read XML. Of the many options available in this arena, this article will examine the use of PHP to parse an XML document.

PHP’s Take on XML

There are two widely-used methods for programming languages to read XML documents: event-based APIs and Document Object Model (DOM) APIs. In the latter class of APIs, XML documents are read into memory in their entirety and can then be manipulated through a set of functions that provide access to an object oriented model of the document (the DOM) in memory. DOM APIs are generally considered to be more powerful; however, they suffer from one serious drawback: they are ill-suited to processing large XML documents, which would take too much memory to build the model of the document.

PHP uses an event-based API to process XML. In such models, the XML document is read in from beginning to end, setting off an event whenever a start tag, end tag, or block of character data is encountered. Each of these events causes a function of the programmer’s choice to be called. Thus, reading an XML document with an event-based API like that of PHP is simply a matter of writing the functions to react appropriately to the events that occur as PHP moves through the document.

Here’s the basic code for setting up event-handling functions and parsing (reading in) an XML document in PHP:

// Create an XML parser
$xml_parser = xml_parser_create();

// Set the functions to handle opening and closing tags
xml_set_element_handler($xml_parser, "startElement", "endElement");

// Set the function to handle blocks of character data
xml_set_character_data_handler($xml_parser, "characterData");

// Open the XML file for reading
$fp = fopen("http://www.sitepoint.com/rss.php","r")
or die("Error reading RSS data.");

// Read the XML file 4KB at a time
while ($data = fread($fp, 4096))
// Parse each 4KB chunk with the XML parser created above
xml_parse($xml_parser, $data, feof($fp))
// Handle errors in parsing
or die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));

// Close the XML file
fclose($fp);

// Free up memory used by the XML parser
xml_parser_free($xml_parser);

Each of the lines of the above code are commented to explain what they do, but let’s look at the XML-related PHP functions that are used in the above code in a little more detail:

  • xml_parser_create() Creates an XML parser. Just as you must create a database connection in PHP if you want to interact with a database, you must create an XML parser to use when you want to read in an XML file. In the above example, a reference to the parser is stored in $xml_parser.
  • xml_set_element_handler(parser, startElementFunction, endElementFunction) This function specifies the functions that an XML parser should use to process the events generated opening and closing tags. In this case, the parser is the one stored in our $xml_parser variable, while the functions are called startElement and endElement. These functions will be defined elsewhere in the PHP script (I’ll give an example below).
  • xml_set_character_data_handler(parser, characterDataFunction) This function specifies the function that the XML parser should use to process character data appearing between tags in an XML document. Once again we use our $xml_parser variable in the example above. The function we choose to process character data is called characterData.
  • xml_parse(parser, data, endOfDocument) This function sends all or part of an XML document to the parser for it to process. The endOfDocument parameter should be set to true if the data marks the end of of XML document, or false if more of the document will follow in a subsequent call to xml_parse. This allows the parser to correctly catch unclosed tags at the end of the document and so forth. In our example, the parser is once again $xml_parser. The $data variable (up to 4KB in size) retrieved from the file with fread is passed as the data to be processed, while the feof function is used to determine whether PHP has reached the end of the XML file or not, thus providing the required endOfDocument parameter. If an error occurs in the parsing of the document, we print out the error message and the line of the file at which it occurs with xml_error_string, xml_get_error_code and xml_get_current_line_number, all of which are described in detail in the PHP manual if you’re curious.
  • xml_parser_free(parser) Although all memory resources are freed at the end of a PHP script, you may wish to free up the memory used by the XML parser if your script will perform other potentially memory-intensive tasks after it parses the XML data. This function destroys the specified XML parser, thus freeing up resources and memory it may have allocated for parsing.

There are a few additional functions that let you handle some of the more esoteric events that occur during XML parsing, but these are well documented in the PHP manual so I’ll leave them for you to read up on if your particular application requires them. For our purposes (reading an RSS file), we now have everything we need. All that’s left is to write the three event handling functions: startElement, endElement, and characterData.

Event Handlers for RSS Parsing

We have three functions to write. Each of these functions must take certain parameters. These parameters are dictated by PHP, since it is PHP’s XML parser that will call them. Here are the attributes that you must define for each of these functions:

startElement($parser, $tagName, $attrs)

  • $parser will be passed a reference to the XML parser that is being used to parse the document.
  • $tagName is the ALL-UPPERCASE (the PHP manual calls this ‘case-folded’) version of the name of the opening tag that triggered the event.
  • $attrs is an associative array of the attributes that are present in the tag that triggered the event. For example, if the tag <body bgcolor="#FFFFFF"> triggered the event, then the value of $attrs['BGCOLOR'] would be "#FFFFFF". Note that, like the tag name, attribute names are case-folded (all uppercase).

endElement($parser, $tagName)

  • $parser will be passed a reference to the XML parser that is being used to parse the document.
  • $tagName is the case-folded name of the closing tag that triggered the event.

characterData($parser, $data)

  • $parser will be passed a reference to the XML parser that is being used to parse the document.
  • $data is a string of text appearing between XML tags in the document. The text between two tags will not necessarily trigger a single event. Blocks of text spread over multiple lines will cause one event per line, with each event being passed the $data for that line.

With this in mind, the process of converting the XML data for SitePoint’s RSS file into a viewable HTML document may seem fairly straightforward at first glance. If you stop and try to work out what the three event handling functions should do, however, you’ll quickly realise that it’s not quite as simple as it seems. For those of you who may be feeling lost at this stage, don’t worry. Looking at definitions for these functions that will process SitePoint’s RSS file (or indeed any site’s RSS file) should help it all make sense.

The first complexity that may strike you is that the characterData function must react to text appearing between tags, but nothing is passed to the function to tell it which tags contain the text being processed. For this reason, most XML parsing scripts will need to define a set of global variables to track information received by one of the event-handling functions for use by the others.

In the case of our RSS file, all the information we need about the articles on SitePoint’s cover page is contained in the <item> tags in the document. So the first global variable we’ll define will be $insideitem, which we’ll set to true when entering an <item> tag and false when exiting one. We’ll also define four other variables, the purposes for which will become clear as we move along:

$insideitem = false;
$tag = "";
$title = "";
$description = "";
$link = "";

Let’s begin with startElement. This function will be called by the XML parser whenever an opening tag is encountered. Since we’re only really interested in what goes on between <item> tags, we’ll first check if we are indeed inside an <item> tag:

function startElement($parser, $tagName, $attrs) {
global $insideitem, $tag;
if ($insideitem) {

Note the global statement at the start of the function, which indicates that this function will need access to the $insideitem and $tag global variables. Now, if $insideitem is true, it means we’re going to want to take note of the tag that is starting so we know what to do with the character data it contains, which will trigger a call to characterData next. So we record the name of the tag ($tagName) in our global $tag variable:

$tag = $tagName;

If, on the other hand, we’re not inside an <item> tag, then the only opening tag that we could possibly be interested in would be an actual <item> tag, in which case we would set $insideitem to true to indicate that we were entering one of these tags:

} elseif ($tagName == "ITEM") {
$insideitem = true;
}
}

Note that we are checking if $tagName is "ITEM", since tag names are case-folded to all uppercase.

That does it for opening tags. The next step in parsing our RSS document is handling the character data that appears between tags, and that’s the job of our characterData function:

function characterData($parser, $data) {
global $insideitem, $tag, $title, $description, $link;

This function requires access to all five of our global variables, as we’ll see shortly. Now, as before, the only time we are interested in the character data in the XML file is when we are inside an <item> tag, so the first step again is to check if that is the case:

if ($insideitem) {

Now, there are three different tags that can appear inside <item> tags that we are interested in: <title>, <description> and <link>. Now, since we want to display the title of each article above its description and with a link to the URL specified in the <link> tag, we can’t simply output the character data as it is encountered in the XML file. Instead, we need to collect all the data for each <item> tag and then print it all out at once. Our global $title, $description and $link variables will be used for this exact purpose. We will use a switch statement to determine which tag we are dealing with and store the $data in the corresponding variable. Recall that the name of the current tag is stored in the global $tag variable.

switch ($tag) {
case "TITLE":
$title .= $data;
break;
case "DESCRIPTION":
$description .= $data;
break;
case "LINK":
$link .= $data;
break;
}
}
}

Note that we append (.=) the $data to the variable in question, rather than simply assigning it (=) because the contents of a single tag can be received as several consecutive characterData events.

Once the character data for a tag has been processed, the next event to occur will call our endElement function to indicate the closing tag. In this application, the only tag that will require action on our part following its closing is the <item> tag. When the </item> tag is encountered, we will have retrieved all the $title, $description and $link data for the item, and so we can then output it as HTML:

function endElement($parser, $tagName) {
global $insideitem, $tag, $title, $description, $link;
if ($tagName == "ITEM") {
printf("<p><b><a href='%s'>%s</a></b></p>",
trim($link),htmlspecialchars(trim($title)));
printf("<p>%s</p>",htmlspecialchars(trim($description)));

Feel free to use echo statements if you’re not used to the more convenient printf function I used above. In either case, once you’ve output the URL, title and description of the <item>, you can clear the global variables so that they’re ready to receive the character data for the next <item> in the document:

$title = "";
$description = "";
$link = "";

And then finally set $insideitem to false to indicate to our other functions that we are no longer inside an <item> tag.

$insideitem = false;
}
}

That’s it! To see this script in action, click here. You can also see the complete source code (use the view source command in your browser if the source code isn’t displayed as a text file).

Another OOPtion

The more experienced programmers in the audience may have cringed as soon as I mentioned the use of global variables. There is a school of thought that global variables are merely a sign of lazy programming, and indeed PHP’s Object Oriented Programming (OOP) features provide a better option. Here’s an alternate version of the script we have just developed. Instead of functions to handle the XML document events, we use the methods of a PHP object. The data that these methods must share can then be stored as instance variables of the object, thus eliminating the need for global variables in our script.

class RSSParser {

var $insideitem = false;
var $tag = "";
var $title = "";
var $description = "";
var $link = "";

function startElement($parser, $tagName, $attrs) {
if ($this->insideitem) {
$this->tag = $tagName;
} elseif ($tagName == "ITEM") {
$this->insideitem = true;
}
}

function endElement($parser, $tagName) {
if ($tagName == "ITEM") {
printf("<p><b><a href='%s'>%s</a></b></p>",
trim($this->link),htmlspecialchars(trim($this->title)));
printf("<p>%s</p>",
htmlspecialchars(trim($this->description)));
$this->title = "";
$this->description = "";
$this->link = "";
$this->insideitem = false;
}
}

function characterData($parser, $data) {
if ($this->insideitem) {
switch ($this->tag) {
case "TITLE":
$this->title .= $data;
break;
case "DESCRIPTION":
$this->description .= $data;
break;
case "LINK":
$this->link .= $data;
break;
}
}
}
}

$xml_parser = xml_parser_create();
$rss_parser = new RSSParser();
xml_set_object($xml_parser,&$rss_parser);
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");
$fp = fopen("http://www.sitepoint.com/rss.php","r")
or die("Error reading RSS data.");
while ($data = fread($fp, 4096))
xml_parse($xml_parser, $data, feof($fp))
or die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
fclose($fp);
xml_parser_free($xml_parser);

If you’re familiar with object oriented programming in PHP, then the only line that may be a source of consternation for you is the following:

xml_set_object($xml_parser,&$rss_parser);

Since xml_set_element_handler and xml_set_character_data_handler cannot take references to the methods of objects as the function names of the event handlers (i.e. xml_set_element_handler($xml_parser,"$rss_parser->startElement",... won’t work), you need a way to tell the XML parser to call methods of the $rss_parser object instead of basic functions. xml_set_object does just that, taking the parser as well as a reference (note the &) to the object whose methods you want it to call.

Download the full code of this Object Oriented version here.

Summary and Resources for Further Reading

In this article we learned how to parse XML documents (taking the specific example of a RSS 1.0 document) using PHP. Although its event-based API for XML document processing can require some thought to determine how to process even a simple XML document structure, we saw that thinking of the events as they occur in the processing of the document can help to determine what each of the event handling methods need to do.

In a fully-coded example, I demonstrated how to parse SitePoint’s RSS file, which lists the cover articles on SitePoint.com at any given time. Feel free to adapt this code for use on your site to list the current headlines on SitePoint.com in whatever format suits your site! For the programming purists out there, I provided an alternate version of the code that avoids the use of global variables by encapsulating the event handlers and the variables they must share inside a PHP class. With a little more work, you could encapsulate all of the XML processing code within the class and place that class in an include file, thus reducing the actual code in your document for displaying the SitePoint headlines to a two-line affair of creating your RSSParser object and then passing one of its methods the URL of SitePoint’s RSS file (http://www.sitepoint.com/rss.php).

For more information on the RSS 1.0 specification, visit the RSS Working Group’s Web site. This site also includes links to information about the older RSS 0.9x standards that are still in use by many sites on the Web today, as RSS 1.0 is still fairly new. You’ll be happy to know that the RSS 0.9x formats are just as easy to parse with PHP as RSS 1.0, if not moreso.

To see the code involved in parsing RSS 0.91 documents, see Mark Robards’ article, Parsing XML With PHP. This short but informative article demonstrates a technique for parsing XML data with PHP that minimizes the number of global variables (or instance variables, if you use an Object Oriented approach) required. Personally I prefer readable code to clever tricks like these, but this technique could considerably simplify the parsing of very complex XML documents.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

No Reader comments