Mangling XML as Text with PHP DOM

Recently I had to do some mass-conversion of HTML files to DITA XML — material I’d written for the upcoming JavaScript Ultimate Reference (the third, and arguably most complicated, part of the SitePoint Reference).

But a problem I came across several times was the sheer complexity of recursive element conversion — <code> becomes <jsvalue> (or one of a dozen similar elements), <a> becomes <xref> … and that’s all simple enough; but each of these elements might contain the other, or further child elements like <em>, and as we walk through the DOM so the incidence of potential recursion increases, until it gets to the point where my brain explodes.

There’s a limit to how much recursion I can get my head around — or rather — a limit to how much I’m prepared to get my head around before I just go the heck with this, why can’t I mangle it as text with regular expressions!?

Unfortunately there doesn’t seem to be a way with PHP DOM to get the text equivalent of any arbitrary node, but we can do that at the Document or DocumentFragment level; so with a little toying-around I came up with a way to leverage that capability and make it work at the Node level.

So for example, let’s begin with this XML:

<?xml version="1.0" encoding="utf-8"?>
<root id="introduction">
	<div class="section">
		The fundamental data type is <code>Node</code>
	</div>
</root>

We have a reference to its DOM saved to a PHP variable called $xmldom. And we want to parse it so that the <code> element becomes a <jstype>, and the <div class="section"> becomes simply <section>, all without affecting the rest of the document.

Here’s the complete code to do it, which I’ll then talk through stage by stage:

$node = $xmldom->documentElement->firstChild;

$doc = new DOMDocument();
$doc->loadXML('<xmltext/>');
$node = $doc->importNode($node, true);
$doc->documentElement->appendChild($node);
$xmltext = ereg_replace('^.*<xmltext>(.*)</xmltext>.*$', '\1', $doc->saveXML());

$xmltext = ereg_replace('<([/]?)code>', '<\1jstype>', $xmltext);
$xmltext = ereg_replace('<([/]?)div[^>]*>', '<\1section>', $xmltext);

$node = $xmldom->createDocumentFragment();
$node->appendXML($xmltext);

$xmldom->documentElement->replaceChild($node, $xmldom->documentElement->firstChild);

In the first step we get a reference to the element we want to work with, and save it to $node.

In the second step we create a new document, and use loadXML() to create a placeholder root node (the loadXML method converts text input to XML, and is one of the cornerstones of our process). Next we import the original node into that document, then use saveXML() to convert the whole document to text (the saveXML method converts an XML document to text, and is as critical as loadXML() for what we’re doing here). The text output is parsed using ereg_replace to remove the outer contents of the document (its prolog and root node) so that we’re left with a text equivalent of the original input node.

In the third step we do whatever text-based mangling we need; in this case it’s simple element name conversions, but it could be anything.

In the fourth step we want to convert our parsed text back into XML, and we do this by creating a document fragment, then using appendXML() to load the text and have it converted to XML (the appendXML method does the same thing as loadXML(), but it doesn’t require an entire document to be created).

Finally, in the fifth step we merge the processed XML back into our original document. The document fragment has the original document as its owner, so we can simply use the replaceChild method to replace the original node and its children with the processed version. (Whenever a document fragment is added to a document, only its children are actually added, the document fragment itself is discarded; DocumentFragment is a virtual construct and never actually appears in a document.)

Both the first and the final step are arbitrary — we could work with an entire document, or just a single node, and edit our referencing and merging statements accordingly. Or we could build a method from the inner steps, which accepts $node as an argument (and maybe an array of replacement expressions), and returns the processed node at the end:

function mangleXML($node)
{
	...
	
	return $node;
}