Get XSL To Do Your Dirty Work

Writing content management systems (CMSs) for a living is a messy business, especially if you rely exclusively on server-side scripting languages like PHP. No matter how well you write the code for a CMS, no matter how much object oriented modularity you throw into it, you’re still going to have to get up to your elbows in troublesome, unreliable code when it comes time to formatting your content for display.

If this sounds familiar to you, if you find yourself messing with tiresome code based on complex regular expressions every time you need to tweak the formatting of your site’s articles, it might just be time to take a look at XSL.

Reinventing the Wheel

If you’re anything like me, you’ve worked on a number of content-driven sites and have come up with a pretty standard formula for the design of the content managemet systems they rely on:

You create a simple, custom set of tags for users to format their articles (tutorials, FAQs, reviews, or what have you) with.

You store the text of the articles, peppered with these custom tags, in a database.

When a visitor to the site views one of the articles, you have a big mess of code that translates that tagged text into a neatly-formatted HTML page for them to view.

This structure is illustrated in Fig. 1 for an average PHP/MySQL-based site:

Fig. 1: A Typical Content-Based Site Design

Now, systems like this certainly work, and usually work very well. So what’s the problem? Depending on what sort of developer you are, you’re likely to run into one of two problems with this approach:

Lack of robustness
Complex code for a relatively simple task

Let’s look at an example. The [b]...[/b] tag illustrated in Fig. 1 is a fairly straightforward matter to convert to the equivalent HTML syntax for display. Here’s how it might be done in PHP:

$document = str_replace('[b]','<b>',$document); 

$document = str_replace('[/b],'</b>',$document);

No brainer, right? But what if someone forgets to type the closing [/b] tag? What if he or she uses two [b] tags instead? Well, either you live with the invalid HTML output that will result from such mistakes, or you step up your code to the next level of complexity, by using a regular expression to detect only valid pairs of tags:

$document = ereg_replace('[b](.*)[/b]','<b>\1</b>',$document);

Better, but this code still doesn’t point out coding mistakes like typing an invalid [v] tag when [b] was intended; it just ignores them. To catch mistakes like those would require even more complex code… and all this to process what is likely to be one of the simplest tags in your system! Imagine the nightmare involved in making sure that a [list] tag contained one or more [*] tags, and that [*] tags didn’t occur outside of [list] tags!

Most sites that follow the design pattern discussed above will settle on simple, custom tag processing code that lacks the robustness to enforce these types of constraints and prevent operator error.

So, what’s the alternative? Don’t reinvent the wheel! Not only will a system built with XSL do all the parsing and checking for you, but it was designed from the ground up to convert custom tag-based documents into HTML pages and other popular document formats with a minimum of fuss. Sound good? Read on!

XSL: The 2 Minute Tour

Extensible Stylesheet Language, or XSL for short, is a combination of three individual languages, all of which are endorsed by the World Wide Web Consortium (W3C):

XSL Transformations (XSLT) let you define a set of rules that take an XML document, carve it up, and spit out a document in another format. XSLT 1.0 was officially released in November 1999, and newer versions are under development.
XML Path Language (XPath) lets you point to tags, attributes, and other things inside of an XML document with paths similar to the file and directory names your computer uses to access files on your hard drive. XSLT uses XPath to pick out sections of an XML document to be used in the conversion to another document type. XPath 1.0 was standardized in November 1999 at the same time as XSLT, and has remained stable since then.
XSL Formatting Objects (XSL-FO) comprise the portion of the XSL standard that let you format sections of a document created with XSLT by specifying attributes such as colors, font, and spacing when the document type that is to be created supports them (e.g. PDF). XSL-FO was only finalized on October 15th 2001, an event that heralded the official release of the complete XSL 1.0 specification.

Confusing combinations of standards aside, XSL is a much more robust, and (despite appearances) much less complex way of processing custom-tagged documents for display as formatted HTML pages.

Fig. 2: XML to Formatted HTML Conversion with XSL

As shown in Fig. 2, you just feed an XML document along with the XSL stylesheet that contains the rules to convert it to HTML into an XSL processor (there are quite a few available). The XSL processor handles all the checking for missing or invalid tags and then performs the transformations specified in the XSL stylesheet. What’s produced is your fully formatted HTML document, ready to be sent to the user’s browser!

Microsoft Internet Explorer 6+ and Netscape 6+ both have standards-compliant XSLT processors built in (that is, they support XSL Transformations, but not the newly-introduced XSL-FO portion of the XSL standard), which makes it really easy to learn XSL on your own computer before you put it to work on your site. Let’s look at an example to demonstrate the basics of XSL.

Our Sample Document

Here’s a simple example of an article written as an XML document. By its very nature, XML lets you define your own tags, so your site’s document format can be as simple or as complex as you like. For this example, however, I’m using tags from the DocBook XML document format under the premise that it’s always better to use a standard when one is available. See the Further Reading section at the end of this article for more information on the DocBook format.

<?xml version="1.0" encoding="UTF-8"?>   

<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook V4.0//EN"   

          "http://www.oasis-open.org/docbook/xml/4.0/docbookx.dtd">   

<article>   

  <title>A Sample Article</title>   

  <section>   

    <title>Article Section 1</title>   

    <para>   

    This is the first section of the article. Nothing terribly   

    interesting here, though.   

    </para>   

  </section>   

  <section>   

    <title>Another Section</title>   

    <para>   

    Just so you can see how these things work, here's an   

    itemized list:   

    </para>   

    <itemizedlist>   

      <listitem>   

        <para>The first item in the list</para>   

      </listitem>   

      <listitem>   

        <para>The second item in the list</para>   

      </listitem>   

      <listitem>   

        <para>The third item in the list</para>   

      </listitem>   

    </itemizedlist>   

  </section>   

</article>

This should look nice and simple, except for the first few lines. Let’s take a closer look at those:

<?xml version="1.0" encoding="UTF-8"?>

This line is actually optional. It identifies the rest of the file as an XML document, indicates the version of the XML standard that it obeys (1.0), and sets the document encoding (UTF-8). If this document is to be used in a system where all documents are XML, you can leave off this line without a problem.

<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook V4.0//EN"   

          "http://www.oasis-open.org/docbook/xml/4.0/docbookx.dtd">

This rather gruesome looking thing is the Document type (DOCTYPE) declaration. It tells any program that reads this document (for our purposes, the XSL processor) where to find the Document Type Definition (DTD) that decribes what tags are allowed in what structure for this document. In this example, we indicate that the document should obey the DocBook 4.1.2 standard, which is defined at the URL shown (check it out with your Web browser if you’re curious — to learn more about DTDs, pick up any good book on XML). If the rest of the document doesn’t obey the rules defined at that URL, the XSL processor will point out the error when it tries to process the document.

The DOCTYPE declaration is actually optional as well, but if you don’t include it the XSL processor will not check that your document is correctly structured with valid tags. Without a DTD, it will only check that the tags you use are all matched with closing tags and properly nested (e.g. <b><i>this is good</i></b>, but <b><i>this is not</b></i>).

If you decide to use a DTD to validate the tags used on your site, you’ll probably want to add the DOCTYPE declaration to documents automatically before they are processed, rather than forcing the user to type it at the top of each article. You could even add the <article> and </article> tags automatically as well, to minimize the number of tags that the user has to type.

Save the above file as docbook.xml and then open it in MSIE 6 or above. You should see something like Fig. 3.

Fig. 3: An XML Document in MSIE 6

By default, MSIE displays XML documents in this attractive, collapsible code view (you can expand and collapse portions of the document by clicking the red + and – icons next to the opening tags). Believe it or not, this view is actually generated in Dynamic HTML by a built-in XSL stylesheet that is applied whenever an XML document doesn’t come with a stylesheet of its own.

Something else you might notice about Internet Explorer’s XML processing engine is that it ignores the rules for acceptable tags and document structure set out in the DTD. If you changed the <article> and </article> tags to <invalidtag> and </invalidtag> (or any other tag not present in the DocBook Standard), Internet Explorer would not complain. That’s because the XML processor in MSIE is said to be non-validating. More advanced, validating XML processors will validate the tags.

Internet Explorer does check that the XML document is well-formed, however. Try removing one of the </listitem> tags in the <itemizedlist>; you should get an error message as shown in Fig. 4.

Fig. 4: Internet Explorer spots an unclosed tag

With our XML document ready to go, the next step is to create an XSL stylesheet to format it for display.

Your First XSL Stylesheet

Like the documents they are created to format, XSL stylesheets are XML documents. You can therefore write your first XSL stylesheet in any text editor that you find convenient. Type the following in and save it as docbook.xsl:

<?xml version="1.0" encoding="UTF-8"?>    

<xsl:stylesheet version="1.0"    

  xmlns:xsl="https://www.w3.org/1999/XSL/Transform">    

    

  <xsl:output method="xml" indent="yes" encoding="utf-8"    

    doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN"    

    doctype-system="https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" />    

    

  <!-- templates go here -->    

    

</xsl:stylesheet>

This is the basic shell for any XSL stylesheet that outputs HTML documents. Once again, you’ll see that it starts with the optional (but advisable) <?xml ...?> tag that brands this as an XML file. You need not bother with a <!DOCTYPE> tag, since the XSL processor knows all about XSL files and which tags are and are not allowed without the help of a DTD.

The <xsl:stylesheet> tag should be the outer element of every XSL file. The version attribute indicates that we are using XSL version 1.0 syntax. The xmlns:xsl attribute sets up an XML namespace for all our XSL tags. Basically, this attribute says that all tags that start with xsl: (this is called a prefix) are related to the URL https://www.w3.org/1999/XSL/Transform. If you try going to that URL in your Web browser, you’ll see the message “This is the XSLT namespace.” This page doesn’t actually provide any information to the XSL processor, but all XSL processors will only process tags associated with that URL. This lets you use tags such as <stylesheet> and <output> in your own documents, which will be ignored by the XSL processor. You can use any prefix you like to associate the XSL tags in your document with that URL (e.g. if the attribute were xmlns:exesel="https://www.w3.org/1999/XSL/Transform", then all XSL tags would have to begin with the prefix exesel:), but xsl: is the de facto standard.

Inside <xsl:stylesheet> there is only one tag in our basic ‘shell’: <xsl:output ... />. This tells the XSL processor that this stylesheet will output an XHTML document, as opposed to, say, a text file. The attributes of this tag may look a little complicated, but really they’re just setting the values that will appear in the <?xml ...?> and <!DOCTYPE> tags in the XHTML document generated. The / on the end of this tag indicates that it is an empty tag, and so a closing tag is not needed.

Like in HTML, comments in XML documents are created with  tags; thus, the tag  will be ignored by XSL processors.

Let’s take a look at what happens when we apply this simple stylesheet to the article we created in the previous section (docbook.xml). With most XSL processors, we would specify the document and stylesheet we want to process, and it would spit out the resulting HTML file. In XSL-aware browsers like Internet Explorer 5+ and Netscape 6+, however, you need to add a tag to your XML document to tell it which stylesheet to use when displaying the document. At the top of docbook.xml, just after the <!DOCTYPE> tag and just before <article>, add the following line:

<?xml-stylesheet href="docbook.xsl" type="text/xsl"?>

This is a processing instruction that tells browser-based XSL processors (and some standalone processors that support it) where to find an XSL stylesheet that is appropriate for this document. In this example, we have told it to use "docbook.xsl", located in the same directory as the current document. Save this change, make sure that the two files are in the same directory, then view docbook.xml in either IE6+ or NS6+. Fig. 5 demonstrates how it should look in MSIE 6.

Fig. 5: A DocBook with Minimal Style

As you can see, the default behavior of an XSL stylesheet is to go through the XML document a tag at a time and print out the text contained therein. To change this behavior and make our document readable, we need to add some rules to our stylesheet. In the language of XSL, these rules are called templates. Here’s an example of a template:

  <xsl:template match="/article">    

  <html>    

  <head>    

  <title><xsl:value-of select="title"/></title>    

  </head>    

  <body>    

  <h1><xsl:value-of select="title"/></h1>    

  <xsl:apply-templates select="section"/>    

  </body>    

  </html>    

  </xsl:template>

As you can see, this is a mix of XSL tags (identified by the xsl: prefix, and shown in bold) and familiar HTML tags all contained within an <xsl:template> tag. The majority of XSL templates work by matching tags that appear in the XML document to be processed. The tag(s) to match are specified in the match attribute of the <xsl:template> tag.

In this case, our template is set to match /article. This is an XPath expression (remember, XPath is the standard for pointing to tags in an XML document). The leading / indicates the ‘root’ of the XML document, so /article means that this template should match any <article> tag that appears in the root of the XML document. Since our DocBook document begins with an <article> tag, this template will match that tag.

So the XSL processor sees that there is a template that matches the <article> tag in the root of our document. Now what? The processor looks inside the <xsl:template> tag to see what to do about it. The template begins with three HTML tags: <html>, <head> and <title>. Since these are not XSL tags (they don’t begin with the xsl: prefix), the processor writes these tags straight to the output document.

The next tag is an XSL tag: <xsl:value-of select="title"/>. The <xsl:value-of> tag lets you pick out a tag with an XPath expression and output the text it contains (it’s value) at a particular point in the file. The tag to output the value of is specified with the select attribute. In this case, we have select="title". This says that we want to choose the <title> tag that is inside the current tag (the current tag is the tag that matched the template — <article>). Looking back at the sample document, you should find that the <article> tag contains a <title> tag with the article’s title in it (“A Sample Article”). So what we’ve just done is take that title and use it as the page title in the HTML document to be created!

Note that, since the <xsl:value-of> tag doesn’t contain any text or tags, we have made the closing </xsl:value-of> tag part of the opening tag by ending it with a slash (/). Without this shortcut, we would have had to type <xsl:value-of select="title"></xsl:value-of>.

After a few more HTML tags (</title>, </head>, <body>), we have another <xsl:value-of> tag surrounded by HTML <h1>...</h1> tags. This tag is identical to the one used for the title of the page, so once again it will print out the title of our document, but this time between <h1>...</h1> tags, so that it is displayed in big letters at the top of the page.

The next XSL tag in the document is <xsl:apply-templates select="section"/>. This powerful tag tells the XSL processor to take any and all <section> tags that appear within the current tag (<article>) and apply any matching templates to them. At this stage, we only have this one template in our XSL stylesheet, so the default behavior of outputting the contents of the tags and any subtags takes effect.

Once the two the <section> tags are processed in this manner, the XSL processor returns here to finish this template by outputting the </body> and </html> tags. Having reached the end of the document (there are no more tags after the closing </article> tag), the XSL processor terminates.

Here’s the HTML document that is produced from our sample document by our XSL stylesheet:

<?xml version="1.0" encoding="utf-8"?>    

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"    

"https://www.w3.org/TR/xhtml11/DTD/xhtml1-strict.dtd">    

<html>    

   <head>    

      <title>A Sample Article</title>    

   </head>    

   <body>    

      <h1>A Sample Article</h1>    

      Article Section 1    

    

      This is the first section of the article. Nothing terribly    

      interesting here, though.    

    

      Another Section    

    

      Just so you can see how these things work, here's an    

      itemized list:    

    

    

      The first item in the list    

    

      The second item in the list    

    

      The third item in the list    

    

    

   </body>    

</html>

If you update your copy of docbook.xsl with the above template and then view docbook.xml in your browser again, you’ll see this HTML document displayed as in Fig. 6.

Fig. 6: A Slightly More Stylish DocBook

Let’s add a few more templates to the stylesheet:

  <xsl:template match="section">    

    <xsl:apply-templates/>    

    <hr/>    

  </xsl:template>

This template matches <section> tags, and will be triggered by the <xsl:apply-templates select="section"/> tag in our previous template above. For each <section> in the <article>, it will apply templates (or default behavior) to any sub-tags and then output a <hr/> tag.

  <xsl:template match="section/title">    

    <h2><xsl:apply-templates/></h2>    

  </xsl:template>

This template matches <title> tags that occur inside <section> tags (i.e. it will not match the <title> tag at the top of the <article>), and outputs the contents of the tag (applying any applicable templates) between <h2>...</h2> tags.

These remaining three templates should be quite self-explanatory:

  <xsl:template match="para">    

    <p><xsl:apply-templates/></p>    

  </xsl:template>    

    

  <xsl:template match="itemizedlist">    

    <ul><xsl:apply-templates/></ul>    

  </xsl:template>    

    

  <xsl:template match="listitem">    

    <li><xsl:apply-templates/></li>    

  </xsl:template>

Click here to download the completed docbook.xsl file if you have any doubts about how this should all fit together, then view the docbook.xml file one more time in your favorite browser, this time with the full complement of templates. It should display as shown in Fig. 7.

Fig. 7: The Fully Styled Article

If you’re interested in seeing an even more stylish version of your document, the DocBook Open Repository contains an official XSL stylesheet distribution that aims to support and format all of the tags defined in the DocBook standard. If you’re interested, download the latest stable version and see what the article looks like with that stylesheet applied.

Putting It All Together

Although this article only allowed us to scratch the surface of what XSL is capable of, I hope at least that I’ve convinced you that it’s easier to write an XSL stylesheet to convert your custom-tagged document into a viewable HTML page than it is to write the equivalent script in, say ASP, PHP or Perl.

Eventually all browsers in common use will have built-in support for XSL the way Internet Explorer 6+ and Netscape 6+ do, at which point we can just send the browser an XML file with the XSL stylesheet to display it. For the time being, however, practical applications the XSL processor must be integrated into the server, as shown in Fig. 8.

Fig. 8: An XML-Based CMS

If this looks like it might be more complicated to set up than a traditional CMS, where the document formatting is done with custom code, it is. The payoff for this extra setup time more than makes up for it, however:

The code is much simpler and more reusable.
XSL Templates are easier to maintain.
The tag parser is written for you, and is much more robust.
XSL parsers give better performance than custom code.

In future articles, we’ll delve more deeply into XSL to create more powerful templates and learn how to set up PHP and other server-side languages to use an XSL processor. For now, though, I’ll leave you with this list of related resources, which should give you plenty to think about before you design your next CMS.