PHP DOM: Working with XML

SimpleXML allows you to quickly and easily work with XML documents, and in the majority of cases SimpleXML is sufficient. But if you’re working with XML in any serious capacity, you’ll eventually need a feature that isn’t supported by SimpleXML, and that’s where the PHP DOM (Document Object Model) comes in.

PHP DOM is an implementation of the W3C DOM standard and it adheres more to the object model than does SimpleXML. It may seem a little overwhelming at first, but if you’re willing to learn then you’ll find that this library for accessing and manipulating XML documents provides a great deal of control over working XML documents in PHP. This is because DOM differentiates between the various constituents of an XML document, such as different node types.

To explore some of the basic functionality associated with PHP DOM, let’s create a class which is able to add and remove books in library and query the catalog. It should offer the following functionality:

  • Query for a book found by its ISBN
  • Add a book to the library
  • Remove a book from the library
  • Find all books of a specific genre

The DTD and XML

In this article, I’ll use the following DTD and XML that describe a library and its books. This should provide enough material to demonstrate how the extension can be used:

<!ELEMENT library (book*)> 
<!ELEMENT book (title, author, genre, chapter*)> 
  <!ATTLIST book isbn ID #REQUIRED> 
<!ELEMENT title (#PCDATA)> 
<!ELEMENT author (#PCDATA)> 
<!ELEMENT genre (#PCDATA)> 
<!ELEMENT chapter (chaptitle,text)> 
  <!ATTLIST chapter position NMTOKEN #REQUIRED> 
<!ELEMENT chaptitle (#PCDATA)> 
<!ELEMENT text (#PCDATA)>
<?xml version="1.0" encoding="utf-8"?> 
<!DOCTYPE library SYSTEM "library.dtd"> 
<library> 
  <book isbn="isbn1234"> 
    <title>A Book</title> 
    <author>An Author</author> 
    <genre>Horror</genre> 
    <chapter position="first"> 
      <chaptitle>chapter one</chaptitle> 
      <text><![CDATA[Lorem Ipsum...]]></text> 
    </chapter> 
  </book> 
  <book isbn="isbn1235"> 
    <title>Another Book</title> 
    <author>Another Author</author> 
    <genre>Science Fiction</genre> 
    <chapter position="first"> 
      <chaptitle>chapter one</chaptitle> 
      <text><![CDATA[<i>Sit Dolor Amet...</i>]]></text> 
    </chapter> 
  </book> 
</library>

One of the most important things in understanding DOM is the concept of a node. A node is essentially any conceptual item in the XML document. If it’s an element (such as chapter) then it’s a node. If it’s an attribute, (such as isbn), then it’s viewed as a node by DOM. Nodes provide the atomic structure of an XML document.

PHP DOM subclasses DOMNode to provide child classes which represent different aspects of the document. So, DOMDocument actually inherits from DOMNode. DOMElement and DOMAttr also inherit from DOMNode. Having a common parent class enables you to have common methods and properties available to all nodes, such as those used to determine a node’s type, value, or even adding to it.

The Library Class

A class called Library offers methods for the required functionality that was outlined in the introduction. It also has a constructor and destructor, and internal properties to store the DOM Document and path to the XML file. The various operations are performed on the DOM Document reference, and the path is used when saving the tree back as XML back to the file system.

<?php 
class Library
{
    private $xmlPath;
    private $domDocument;

    public function __construct($xmlPath) {
        // TODO: instantiate the private variable representing
        // the DOMDocument
    }

    public function __destruct() {
        // TODO: free memory associated with the DOMDocument
    }

    public function getBookByISBN($isbn) {
        // TODO: return an array with properties of a book 
    }

    public function addBook($isbn, $title, $author, $genre, $chapters) {
        // TODO: add a book to the library 
    }

    public function deleteBook($isbn) {
        // TODO: Delete a book from the library
    }

    public function findBooksByGenre($genre) {
        // TODO: Return an array of books
    }
}

I’ll deliberately keep things simple as the example only serves to demonstrate what DOM can do. In a real-world application, perhaps you’d instantiate book objects to encapsulate the problem more fully, and you’d probably want to handle errors more gracefully as well. You don’t need to do this at this stage, though. We can just assume that values passed and returned are strings or arrays, and errors can be handled by throwing a generic exception.

Handling Object Construction and Destruction

The constructor is designed to take the path to the XML document that you want to use as an argument. There are a few of tests it does to ensure that the document is valid.

The first test is to determine the document being loaded uses the “library” doctype. Each DOMDocument has the public property doctype which returns the doctype used by the XML document. So for this example, you should see that the doctype property is set to “library” when you’ve loaded up the document.

The second test is to ensure that the definition used is defined in the correct manner using the public systemId or publicId properties. The XML used here is defined by a DTD specified on the system as library.dtd, so it tests for that by comparing it against the systemId property.

The third test is to ensure that the document itself is valid according to the DTD. The validation of the document also checks whether the document is well-formed (i.e. tag mismatches, etc.) and that it adheres to the DTD on which it is based.

Once all of these conditions are met, it stores a reference to the loaded document and path to the XML file as internal properties to be used later by other methods. But if at any point one of the tests fail, an exception is thrown.

<?php
public function __construct($xmlPath) { 
    //loads the document 
    $doc = new DOMDocument(); 
    $doc->load($xmlPath); 

    //is this a library xml file? 
    If ($doc->doctype->name != "library" ||
        $doc->doctype->systemId != "library.dtd") { 
        throw new Exception("Incorrect document type"); 
    } 

    //is the document valid and well-formed? 
    if($doc->validate()) {
        $this->domDocument = $doc; 
        $this->xmlPath = $xmlPath;
    } 
    else {
        throw new Exception("Document did not validate"); 
    } 
}

The destructor method releases any memory used by the $domDocument. This is really just a simple call to unset the property.

<?php
public function __destruct() { 
    unset($this->domDocument); 
}

Return a Book by its ISBN

Now on to the main methods for reading an manipulating the underlying XML document.

The first method obtains details of a book from a provided ISBN. You can provide the ISBN as a string and the method returns an array detailing the properties of the book.

PHP DOM provides a very simple function to return a specific element based on it’s ID – getElementById() which returns a DOMElement object. For this to work, you will have to have nominated an ID with your DTD, as I did:

<!ATTLIST book isbn ID #REQUIRED>

It’s important to know that getElementById() only works if the document has been validated against a DTD. If not, then the function will simply not pick up the fact that the element has an ID.

Another way of obtaining elements from a document is to use getElementsByTagName(). This method returns a collection of nodes which have been found with the specified tag name. The collection returned is a DOMNodeList, which is traversable.

Items in the DOMNodeList can also be picked out by their position in the list with item(). Because the DTD defines a book can only have one author, we know that the DOMNodeList will contain one node which can be accessed with item(0). The DTD enforces this fact, and if it were different in the document then you would have received a validation error when the Library object was created.

Once you have found the particular node you want, you can find it’s value using the public property nodeValue.

To access attributes, you can make use of DOMNode‘s pubic property attributes which returns a DOMNamedNodeMap. This is similar to the DOMNodeList in that it is traversable, but you can also pick out a specific attribute using the getNamedItem() method and just pass the name of the attribute as a string. The return value is a DOMNode.

The implementation of the method to retrieve a book and its information thus looks like this:

<?php
public function getBookByISBN($isbn) 
{ 
    // get a book element from the isbn ID 
    $book = $this->domDocument->getElementById($isbn); 

    // if a book was not returned...
    if (!$book) {
        throw new Exception("No book found with ISBN ". $isbn); 
    }

    $arrBook = array();
    $arrBook["isbn"] = $isbn; 

    // get the data from the elements based on their tag names 
    //
    // we know these DOMNodeLists will only return one 
    // item since the DTD states this
    $arrBook["author"] = $book->getElementsByTagName("author")
        ->item(0)->nodeValue;
    $arrBook["title"]  = $book->getElementsByTagName("title")
        ->item(0)->nodeValue; 
    $arrBook["genre"]  = $book->getElementsByTagName("genre")
        ->item(0)->nodeValue; 

    $chapters = $book->getElementsByTagName("chapter"); 

    $arrChapters = array(); 

    // iterate over the chapter elements 
    foreach($chapters as $chapter) { 
        $chapterId = $chapter->attributes
            ->getNamedItem("position")->nodeValue; 
        $chapterTitle = $chapter
            ->getElementsByTagName("chaptitle")->item(0)
            ->nodeValue; 
        $chapterText = $chapter
            ->getElementsByTagName("text")->item(0)
            ->nodeValue; 

        $arrChapter["title"] = $chapterTitle; 
        $arrChapter["text"] = $chapterText; 

        $arrChapters[$chapterId] = $arrChapter; 
    } 

    $arrBook["chapters"] = $arrChapters; 

    return $arrBook; 
}

Identifying and pulling data from an XML document is relatively simple. The main hurdle to overcome is understanding the node concept; once you understand that, you’ll find that obtaining the data you want is a straightforward process.

Adding a Book to the Library

The next method to define adds a book to the XML database. The method takes the properties and an array of chapters of the book to add.

One way of performing such a task is to use the createElement() method and add this new node to the document, and set a reference to it so you can operate on the object from that point forward. When you create an element you must also add it to the document. Using createElement() does not automatically add it to the document for you. It associates the element with document, but that’s as far as it goes. It’s good practice to add elements you intend to be part of the document as soon as they are instantiated so that they are not forgotten!

You can use the documentElement property to identify the root element of the XML document. If we weren’t to do this and just add directly to the document, we would in fact be adding a child to the very end of the document (i.e. outside of the library element). This would result in a validation error. If you think about it, this behaviour of DOM is totally reasonable; treating the document as the root element and adding a child to it would place it after the library element as that is the first child of the document.

Of course, the book element must contain an ISBN, so an attribute must be added to the newly created element. There are two ways of doing this. The simplest is to use setAttribute() which takes the name of the attribute and the value of the attribute as arguments. The second way is to create a DOMAttr object and then append that to the element. DOMAttr is a subclass of DOMNode, so it benefits from all the inherited methods and properties its parent offers.

setAttribute() and setAttributeNode() are responsible for adding and updating attributes associated with an element. If the attribute does not exist, it will be created. If it does exist, it will be updated.

To supply the value for a text element, it is advisable to use DOMCdataSection(). The chapters of the books are given as PCDATA and not CDATA in the DTD. This is because an element cannot be described as containing CDATA directly; we have to declare it as PCDATA and then wrap the content in <![CDATA[...]]>. It sounds counter-intuitive as we need to be able to put unparsed character data in the text element for use later, but this is why we have to create a specific DOMCdataSection; this will safely wrap our text in <![CDATA[...]]>. If you were to add HTML directly to a node, you’ll find that invalid characters such as < or & would be converted to their relevant entities (i.e. &lt; and &amp;). This is because these characters have special meaning is XML. The ampersand for entities, and the greater-than symbol starts a tag. DOM substitutes these so as not to cause any parsing issues when the document is loaded or validated.

The last step in adding a book is to save the new document back into the file, which is done with the document’s save() method.

The method altogether looks like this:

<?php
public function addBook($isbn, $title, $author, $genre, $chapters) 
{ 
    // create a new element represeting the new book 
    $newbook = $this->domDocument->createElement("book"); 
    // append the newly created element
    $this->domDocument->documentElement
        ->appendChild($newbook);

    // setting the attribute can be done in one of two ways 
    // Method One: 
    // $newbook->setAttribute("isbn", $isbn); 

    // Method Two: 
    $idAttribute = new DOMAttr("isbn", $isbn); 
    $newbook->setAttributeNode($idAttribute); 

    $title = $this->domDocument
        ->createElement("title", $title); 
    $newbook->appendChild($title); 

    $author = $this->domDocument
        ->createElement("author", $author); 
    $newbook->appendChild($author); 

    $genre = $this->domDocument
        ->createElement("genre", $genre); 
    $newbook->appendChild($genre); 

    foreach($chapters as $position => $chapter) {
        $newchapter = $this->domDocument
            ->createElement("chapter"); 
        $newbook->appendChild($newchapter); 

        $newchapter->setAttribute("position", $position);

        $newchaptitle = $this->domDocument
            ->createElement("chaptitle", $chapter["title"]);
        $newchapter->appendChild($newchaptitle); 

        $newtext = $this->domDocument->createElement("text"); 
        $newchapter->appendChild($newtext); 

        // Rather than creating a new element, create a
        // DOMCdataSection which ensures our text is
        // wrapped in <![CDATA[ and ]]>
        $cdata = new DOMCdataSection($chapter["text"]); 
        $newtext->appendChild($cdata); 
    }

    // save the document 
    $this->domDocument->save($this->xmlPath); 
}

Deleting a Book from the Library

The next method to tackle is deleting a book. This is just a case of identifying which element in the XML document you want to delete and then use the removeChild() method to remove it. There are two important things to understand, however.

First, you are unable to remove a child from an instance of DOMDocument directly. You have to access the documentElement and remove the child from there. This is for the same reasons why you had to refer to documentElement when adding a book to the library.

Second, removing the element from the document just removes it from memory. If you want to persist the data, you should save it back to a file.

Here’s what the deleteBook() method looks like:

<?php
public function deleteBook($isbn) { 
    // get the book element based on its ID 
    $book = $this->domDocument->getElementById($isbn); 

    // simply remove the child from the documents
    // documentElement 
    $this->domDocument->documentElement->removeChild($book);

    // save back to disk 
    $this->domDocument->save($this->xmlPath);
}

Find Books by Genre

The method to find specific books based on a genre employs XPath to obtain the results we need. getElementById(), as you saw before, is a convenient way of picking items out of the DOM when we have declared an ID within a DTD. But what can we do if we need to query against some other data in the XML? We can use an DOMXPath object. XPath itself is beyond the scope of this article, but I do advise you look at some resources explaining the syntax. The XPath query to find any book item in the XML which has a genre of a specific type is:

//library/book/genre[text() = "some genre"]/..

This query tells first we want to access a genre element in the path //library/book. The two forward slashes indicate that library is the root element, and the single slashes indicate book is a child of library and genre is a child of book. [text() = "some genre"] indicates that we are looking for an where the text inside it is “some genre”. On it’s own, the result would just be the genre element which is why /.. is tagged at the end to indicate that we actually need genre‘s parent.

XPath is a great way to locate nodes in a structure. If you find yourself iterating over a few DOMNodeLists and testing nodeValues for certain values the you’d probably be better off look at an equivalent XPath query which will certainly be much shorter, quicker and easier to read.

Here’s what the search method looks like:

<?php
public function findBooksByGenre($genre) 
{ 
    // use XPath to find the book we"re looking for 
    $query = '//library/book/genre1/..';

    // create a new XPath object and associate it with the document we want to query against 
    $xpath = new DOMXPath($this->domDocument); 
    $result = $xpath->query($query); 

    $arrBooks = array(); 

    // iterate of the results 
    foreach($result as $book)  {
        // add the title of the book to an array 
        $arrBooks[] = $book->getElementsByTagName("title")->item(0)->nodeValue; 
    } 

    return $arrBooks; 
}

Summary

This article was just a taster to show you how you can use DOM to manipulate and report back from XML data. PHP DOM is not as scary as it looks, and you may find that you prefer it over SimpleXML in certain circumstances.

One of the most important things you learned was the concept of the node, the basic building block of an XML document as far as DOM is concerned. You saw how to load an XML document into memory and validate it, pulled data from an XML document using getElementById() and getElementsByTagName(), add and remove elements, work with attributes, and looked at the collections of DOMNodeList and DOMNamedNodeMap to pull collections of data.

While a lot of things you saw today are things that you can probably do easily in SimpleXML already, I hope this article showed you how the same things can be achieved with DOM and what some of the benefits of DOM are.

Image via Fotolia

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • javadecaf

    Shouldn’t this:

    public function findBooksByGenre($genre)
    {
        // use XPath to find the book we&quot;re looking for
        $query = '//library/book/genre1/..';
    

    actually be this?

    public function findBooksByGenre($genre)
    {
        // use XPath to find the book we&quot;re looking for
        $query = '//library/book/'.$genre.'/..';
    

    I know it’s just an example, but I thought you might want to edit it for clarity (unless I’m missing something). Thanks for the article!

  • static07

    Alternatively you could select books of “some genre” by:
    //library/book[genre/text()="some genre"]
    Personally I find this query more elegant.