PHP DOM: Working with XML

SimpleXML allows you to quickly and easily work with XML documents, and in the majority of cases SimpleXML is sufficient. But if you’re working with XML in any serious capacity, you’ll eventually need a feature that isn’t supported by SimpleXML, and that’s where the PHP DOM (Document Object Model) comes in.

PHP DOM is an implementation of the W3C DOM standard and it adheres more to the object model than does SimpleXML. It may seem a little overwhelming at first, but if you’re willing to learn then you’ll find that this library for accessing and manipulating XML documents provides a great deal of control over working XML documents in PHP. This is because DOM differentiates between the various constituents of an XML document, such as different node types.

To explore some of the basic functionality associated with PHP DOM, let’s create a class which is able to add and remove books in library and query the catalog. It should offer the following functionality:

Query for a book found by its ISBN
Add a book to the library
Remove a book from the library
Find all books of a specific genre

Key Takeaways

PHP DOM provides a robust way to manipulate XML documents in PHP, adhering closely to the W3C DOM standard, unlike SimpleXML.
Understanding nodes is crucial in PHP DOM; nodes represent elements, attributes, and other parts of the XML document, providing the atomic structure necessary for manipulation.
The Library class demonstrates practical PHP DOM usage, including methods to add, delete, and find books within a library using XML.
PHP DOM allows for detailed manipulation of XML elements and attributes, including creating new elements, setting attributes, and managing data with methods like `createElement()` and `setAttribute()`.
Deleting and adding elements in PHP DOM involves understanding the document structure and correctly accessing and modifying the document tree.
Querying XML data based on specific criteria can be efficiently handled using XPath within PHP DOM, allowing for complex queries like finding books by genre.

The DTD and XML

In this article, I’ll use the following DTD and XML that describe a library and its books. This should provide enough material to demonstrate how the extension can be used:

<!ELEMENT library (book*)>

<!ELEMENT book (title, author, genre, chapter*)>

  <!ATTLIST book isbn ID #REQUIRED>

<!ELEMENT title (#PCDATA)>

<!ELEMENT author (#PCDATA)>

<!ELEMENT genre (#PCDATA)>

<!ELEMENT chapter (chaptitle,text)>

  <!ATTLIST chapter position NMTOKEN #REQUIRED>

<!ELEMENT chaptitle (#PCDATA)>

<!ELEMENT text (#PCDATA)>

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE library SYSTEM "library.dtd">

<library>

  <book isbn="isbn1234">

    <title>A Book</title>

    <author>An Author</author>

    <genre>Horror</genre>

    <chapter position="first">

      <chaptitle>chapter one</chaptitle>

      <text><![CDATA[Lorem Ipsum...]]></text>

    </chapter>

  </book>

  <book isbn="isbn1235">

    <title>Another Book</title>

    <author>Another Author</author>

    <genre>Science Fiction</genre>

    <chapter position="first">

      <chaptitle>chapter one</chaptitle>

      <text><![CDATA[<i>Sit Dolor Amet...</i>]]></text>

    </chapter>

  </book>

</library>

One of the most important things in understanding DOM is the concept of a node. A node is essentially any conceptual item in the XML document. If it’s an element (such as chapter) then it’s a node. If it’s an attribute, (such as isbn), then it’s viewed as a node by DOM. Nodes provide the atomic structure of an XML document.

PHP DOM subclasses DOMNode to provide child classes which represent different aspects of the document. So, DOMDocument actually inherits from DOMNode. DOMElement and DOMAttr also inherit from DOMNode. Having a common parent class enables you to have common methods and properties available to all nodes, such as those used to determine a node’s type, value, or even adding to it.

The Library Class

A class called Library offers methods for the required functionality that was outlined in the introduction. It also has a constructor and destructor, and internal properties to store the DOM Document and path to the XML file. The various operations are performed on the DOM Document reference, and the path is used when saving the tree back as XML back to the file system.

<?php

class Library

{

    private $xmlPath;

    private $domDocument;
    public function __construct($xmlPath) {

        // TODO: instantiate the private variable representing

        // the DOMDocument

    }
    public function __destruct() {

        // TODO: free memory associated with the DOMDocument

    }
    public function getBookByISBN($isbn) {

        // TODO: return an array with properties of a book

    }
    public function addBook($isbn, $title, $author, $genre, $chapters) {

        // TODO: add a book to the library

    }
    public function deleteBook($isbn) {

        // TODO: Delete a book from the library

    }
    public function findBooksByGenre($genre) {

        // TODO: Return an array of books

    }

}

I’ll deliberately keep things simple as the example only serves to demonstrate what DOM can do. In a real-world application, perhaps you’d instantiate book objects to encapsulate the problem more fully, and you’d probably want to handle errors more gracefully as well. You don’t need to do this at this stage, though. We can just assume that values passed and returned are strings or arrays, and errors can be handled by throwing a generic exception.

Handling Object Construction and Destruction

The constructor is designed to take the path to the XML document that you want to use as an argument. There are a few of tests it does to ensure that the document is valid.

The first test is to determine the document being loaded uses the “library” doctype. Each DOMDocument has the public property doctype which returns the doctype used by the XML document. So for this example, you should see that the doctype property is set to “library” when you’ve loaded up the document.

The second test is to ensure that the definition used is defined in the correct manner using the public systemId or publicId properties. The XML used here is defined by a DTD specified on the system as library.dtd, so it tests for that by comparing it against the systemId property.

The third test is to ensure that the document itself is valid according to the DTD. The validation of the document also checks whether the document is well-formed (i.e. tag mismatches, etc.) and that it adheres to the DTD on which it is based.

Once all of these conditions are met, it stores a reference to the loaded document and path to the XML file as internal properties to be used later by other methods. But if at any point one of the tests fail, an exception is thrown.

<?php

public function __construct($xmlPath) {

    //loads the document

    $doc = new DOMDocument();

    $doc->load($xmlPath); 
    //is this a library xml file?

    If ($doc->doctype->name != "library" ||

        $doc->doctype->systemId != "library.dtd") {

        throw new Exception("Incorrect document type");

    } 
    //is the document valid and well-formed?

    if($doc->validate()) {

        $this->domDocument = $doc;

        $this->xmlPath = $xmlPath;

    }

    else {

        throw new Exception("Document did not validate");

    }

}

The destructor method releases any memory used by the $domDocument. This is really just a simple call to unset the property.

<?php

public function __destruct() {

    unset($this->domDocument);

}

Return a Book by its ISBN

Now on to the main methods for reading an manipulating the underlying XML document.

The first method obtains details of a book from a provided ISBN. You can provide the ISBN as a string and the method returns an array detailing the properties of the book.

PHP DOM provides a very simple function to return a specific element based on it’s ID – getElementById() which returns a DOMElement object. For this to work, you will have to have nominated an ID with your DTD, as I did:

<!ATTLIST book isbn ID #REQUIRED>

It’s important to know that getElementById() only works if the document has been validated against a DTD. If not, then the function will simply not pick up the fact that the element has an ID.

Another way of obtaining elements from a document is to use getElementsByTagName(). This method returns a collection of nodes which have been found with the specified tag name. The collection returned is a DOMNodeList, which is traversable.

Items in the DOMNodeList can also be picked out by their position in the list with item(). Because the DTD defines a book can only have one author, we know that the DOMNodeList will contain one node which can be accessed with item(0). The DTD enforces this fact, and if it were different in the document then you would have received a validation error when the Library object was created.

Once you have found the particular node you want, you can find it’s value using the public property nodeValue.

To access attributes, you can make use of DOMNode‘s pubic property attributes which returns a DOMNamedNodeMap. This is similar to the DOMNodeList in that it is traversable, but you can also pick out a specific attribute using the getNamedItem() method and just pass the name of the attribute as a string. The return value is a DOMNode.

The implementation of the method to retrieve a book and its information thus looks like this:

<?php

public function getBookByISBN($isbn)

{

    // get a book element from the isbn ID

    $book = $this->domDocument->getElementById($isbn); 
    // if a book was not returned...

    if (!$book) {

        throw new Exception("No book found with ISBN ". $isbn);

    }
    $arrBook = array();

    $arrBook["isbn"] = $isbn; 
    // get the data from the elements based on their tag names

    //

    // we know these DOMNodeLists will only return one

    // item since the DTD states this

    $arrBook["author"] = $book->getElementsByTagName("author")

        ->item(0)->nodeValue;

    $arrBook["title"]  = $book->getElementsByTagName("title")

        ->item(0)->nodeValue;

    $arrBook["genre"]  = $book->getElementsByTagName("genre")

        ->item(0)->nodeValue; 
    $chapters = $book->getElementsByTagName("chapter"); 
    $arrChapters = array(); 
    // iterate over the chapter elements

    foreach($chapters as $chapter) {

        $chapterId = $chapter->attributes

            ->getNamedItem("position")->nodeValue;

        $chapterTitle = $chapter

            ->getElementsByTagName("chaptitle")->item(0)

            ->nodeValue;

        $chapterText = $chapter

            ->getElementsByTagName("text")->item(0)

            ->nodeValue; 
        $arrChapter["title"] = $chapterTitle;

        $arrChapter["text"] = $chapterText; 
        $arrChapters[$chapterId] = $arrChapter;

    } 
    $arrBook["chapters"] = $arrChapters; 
    return $arrBook;

}

Identifying and pulling data from an XML document is relatively simple. The main hurdle to overcome is understanding the node concept; once you understand that, you’ll find that obtaining the data you want is a straightforward process.

Adding a Book to the Library

The next method to define adds a book to the XML database. The method takes the properties and an array of chapters of the book to add.

One way of performing such a task is to use the createElement() method and add this new node to the document, and set a reference to it so you can operate on the object from that point forward. When you create an element you must also add it to the document. Using createElement() does not automatically add it to the document for you. It associates the element with document, but that’s as far as it goes. It’s good practice to add elements you intend to be part of the document as soon as they are instantiated so that they are not forgotten!

You can use the documentElement property to identify the root element of the XML document. If we weren’t to do this and just add directly to the document, we would in fact be adding a child to the very end of the document (i.e. outside of the library element). This would result in a validation error. If you think about it, this behaviour of DOM is totally reasonable; treating the document as the root element and adding a child to it would place it after the library element as that is the first child of the document.

Of course, the book element must contain an ISBN, so an attribute must be added to the newly created element. There are two ways of doing this. The simplest is to use setAttribute() which takes the name of the attribute and the value of the attribute as arguments. The second way is to create a DOMAttr object and then append that to the element. DOMAttr is a subclass of DOMNode, so it benefits from all the inherited methods and properties its parent offers.

setAttribute() and setAttributeNode() are responsible for adding and updating attributes associated with an element. If the attribute does not exist, it will be created. If it does exist, it will be updated.

To supply the value for a text element, it is advisable to use DOMCdataSection(). The chapters of the books are given as PCDATA and not CDATA in the DTD. This is because an element cannot be described as containing CDATA directly; we have to declare it as PCDATA and then wrap the content in <![CDATA[...]]>. It sounds counter-intuitive as we need to be able to put unparsed character data in the text element for use later, but this is why we have to create a specific DOMCdataSection; this will safely wrap our text in <![CDATA[...]]>. If you were to add HTML directly to a node, you’ll find that invalid characters such as < or & would be converted to their relevant entities (i.e. < and &). This is because these characters have special meaning is XML. The ampersand for entities, and the greater-than symbol starts a tag. DOM substitutes these so as not to cause any parsing issues when the document is loaded or validated.

The last step in adding a book is to save the new document back into the file, which is done with the document’s save() method.

The method altogether looks like this:

<?php

public function addBook($isbn, $title, $author, $genre, $chapters)

{

    // create a new element represeting the new book

    $newbook = $this->domDocument->createElement("book");

    // append the newly created element

    $this->domDocument->documentElement

        ->appendChild($newbook);
    // setting the attribute can be done in one of two ways

    // Method One:

    // $newbook->setAttribute("isbn", $isbn); 
    // Method Two:

    $idAttribute = new DOMAttr("isbn", $isbn);

    $newbook->setAttributeNode($idAttribute); 
    $title = $this->domDocument

        ->createElement("title", $title);

    $newbook->appendChild($title); 
    $author = $this->domDocument

        ->createElement("author", $author);

    $newbook->appendChild($author); 
    $genre = $this->domDocument

        ->createElement("genre", $genre);

    $newbook->appendChild($genre); 
    foreach($chapters as $position => $chapter) {

        $newchapter = $this->domDocument

            ->createElement("chapter");

        $newbook->appendChild($newchapter); 
        $newchapter->setAttribute("position", $position);
        $newchaptitle = $this->domDocument

            ->createElement("chaptitle", $chapter["title"]);

        $newchapter->appendChild($newchaptitle); 
        $newtext = $this->domDocument->createElement("text");

        $newchapter->appendChild($newtext); 
        // Rather than creating a new element, create a

        // DOMCdataSection which ensures our text is

        // wrapped in <![CDATA[ and ]]>

        $cdata = new DOMCdataSection($chapter["text"]);

        $newtext->appendChild($cdata);

    }
    // save the document

    $this->domDocument->save($this->xmlPath);

}

Deleting a Book from the Library

The next method to tackle is deleting a book. This is just a case of identifying which element in the XML document you want to delete and then use the removeChild() method to remove it. There are two important things to understand, however.

First, you are unable to remove a child from an instance of DOMDocument directly. You have to access the documentElement and remove the child from there. This is for the same reasons why you had to refer to documentElement when adding a book to the library.

Second, removing the element from the document just removes it from memory. If you want to persist the data, you should save it back to a file.

Here’s what the deleteBook() method looks like:

<?php

public function deleteBook($isbn) {

    // get the book element based on its ID

    $book = $this->domDocument->getElementById($isbn); 
    // simply remove the child from the documents

    // documentElement

    $this->domDocument->documentElement->removeChild($book);
    // save back to disk

    $this->domDocument->save($this->xmlPath);

}

Find Books by Genre

The method to find specific books based on a genre employs XPath to obtain the results we need. getElementById(), as you saw before, is a convenient way of picking items out of the DOM when we have declared an ID within a DTD. But what can we do if we need to query against some other data in the XML? We can use an DOMXPath object. XPath itself is beyond the scope of this article, but I do advise you look at some resources explaining the syntax. The XPath query to find any book item in the XML which has a genre of a specific type is:

//library/book/genre[text() = "some genre"]/..

This query tells first we want to access a genre element in the path //library/book. The two forward slashes indicate that library is the root element, and the single slashes indicate book is a child of library and genre is a child of book. [text() = "some genre"] indicates that we are looking for an where the text inside it is “some genre”. On it’s own, the result would just be the genre element which is why /.. is tagged at the end to indicate that we actually need genre‘s parent.

XPath is a great way to locate nodes in a structure. If you find yourself iterating over a few DOMNodeLists and testing nodeValues for certain values the you’d probably be better off look at an equivalent XPath query which will certainly be much shorter, quicker and easier to read.

Here’s what the search method looks like:

<?php

public function findBooksByGenre($genre)

{

    // use XPath to find the book we"re looking for

    $query = '//library/book/genre[text() = "' . $genre . '"]/..';
    // create a new XPath object and associate it with the document we want to query against

    $xpath = new DOMXPath($this->domDocument);

    $result = $xpath->query($query); 
    $arrBooks = array(); 
    // iterate of the results

    foreach($result as $book)  {

        // add the title of the book to an array

        $arrBooks[] = $book->getElementsByTagName("title")->item(0)->nodeValue;

    } 
    return $arrBooks;

}

Summary

This article was just a taster to show you how you can use DOM to manipulate and report back from XML data. PHP DOM is not as scary as it looks, and you may find that you prefer it over SimpleXML in certain circumstances.

One of the most important things you learned was the concept of the node, the basic building block of an XML document as far as DOM is concerned. You saw how to load an XML document into memory and validate it, pulled data from an XML document using getElementById() and getElementsByTagName(), add and remove elements, work with attributes, and looked at the collections of DOMNodeList and DOMNamedNodeMap to pull collections of data.

While a lot of things you saw today are things that you can probably do easily in SimpleXML already, I hope this article showed you how the same things can be achieved with DOM and what some of the benefits of DOM are.

Image via Fotolia

Frequently Asked Questions (FAQs) about PHP DOM and Working with XML

What is the DOM in PHP and why is it important?

The Document Object Model (DOM) in PHP is a programming interface for HTML and XML documents. It represents the structure of a document and allows a programmer to manipulate the content, structure, and styles of a document. The DOM represents a document as a tree structure where each node is an object representing a part of the document. This model is crucial as it allows developers to create, navigate, and modify content dynamically.

How can I create a new DOMDocument in PHP?

Creating a new DOMDocument in PHP is quite straightforward. You simply need to instantiate a new instance of the DOMDocument class. Here’s a simple example:

$doc = new DOMDocument();
This will create a new DOMDocument object that you can then manipulate using various methods provided by the DOMDocument class.

How can I load XML into a DOMDocument?

You can load XML into a DOMDocument using the loadXML() method. This method parses the XML content and if successful, returns a DOMDocument object. Here’s an example:

$doc = new DOMDocument();
$doc->loadXML($xmlString);
In this example, $xmlString is a string containing your XML content.

How can I add elements to a DOMDocument?

You can add elements to a DOMDocument using the createElement() method. This method creates a new instance of the class DOMElement. Here’s an example:

$doc = new DOMDocument();
$element = $doc->createElement('example', 'This is an example');
$doc->appendChild($element);
In this example, ‘example’ is the tag name and ‘This is an example’ is the tag content.

How can I remove elements from a DOMDocument?

You can remove elements from a DOMDocument using the removeChild() method. This method removes a child node from the DOM. Here’s an example:

$doc = new DOMDocument();
$element = $doc->createElement('example', 'This is an example');
$doc->appendChild($element);
$doc->removeChild($element);
In this example, the ‘example’ element is removed from the DOM.

How can I navigate through a DOMDocument?

You can navigate through a DOMDocument using various methods provided by the DOMDocument class. For example, you can use the getElementsByTagName() method to get all elements with a specific tag name. Here’s an example:

$doc = new DOMDocument();
$doc->loadXML($xmlString);
$elements = $doc->getElementsByTagName('example');
In this example, $elements is a DOMNodeList containing all ‘example’ elements in the DOM.

How can I modify the content of a DOMDocument?

You can modify the content of a DOMDocument using the nodeValue property. This property sets or returns the text content of a node and its descendants. Here’s an example:

$doc = new DOMDocument();
$doc->loadXML($xmlString);
$element = $doc->getElementsByTagName('example')->item(0);
$element->nodeValue = 'New content';
In this example, the content of the first ‘example’ element is changed to ‘New content’.

How can I save a DOMDocument as an XML file?

You can save a DOMDocument as an XML file using the saveXML() method. This method returns the XML content of a DOMDocument or a node. Here’s an example:

$doc = new DOMDocument();
$doc->loadXML($xmlString);
$xmlContent = $doc->saveXML();
file_put_contents('example.xml', $xmlContent);
In this example, the XML content of the DOMDocument is saved as ‘example.xml’.

How can I handle errors when working with a DOMDocument?

You can handle errors when working with a DOMDocument by using the libxml_use_internal_errors() function. This function allows you to suppress errors and enable user error handling. Here’s an example:

libxml_use_internal_errors(true);
$doc = new DOMDocument();
if (!$doc->loadXML($xmlString)) {
$errors = libxml_get_errors();
foreach ($errors as $error) {
// handle errors here
}
libxml_clear_errors();
}
In this example, if loading the XML fails, the errors are stored in the $errors array and can be handled as needed.

How can I validate XML against a DTD or schema using a DOMDocument?

You can validate XML against a DTD or schema using the validate() or schemaValidate() methods of the DOMDocument class. Here’s an example:

$doc = new DOMDocument();
$doc->loadXML($xmlString);
if (!$doc->schemaValidate('example.xsd')) {
// handle validation errors here
}
In this example, the XML content of the DOMDocument is validated against the ‘example.xsd’ schema. If the validation fails, the errors can be handled as needed.