Blog Post RSS ?

Blogs » PHP » Mangling XML as Text with PHP DOM
 

Mangling XML as Text with PHP DOM


  • Save to
    Del.icio.us

by James Edwards

Recently I had to do some mass-conversion of HTML files to DITA XML — material I’d written for the upcoming JavaScript Ultimate Reference (the third, and arguably most complicated, part of the SitePoint Reference).

But a problem I came across several times was the sheer complexity of recursive element conversion — <code> becomes <jsvalue> (or one of a dozen similar elements), <a> becomes <xref> … and that’s all simple enough; but each of these elements might contain the other, or further child elements like <em>, and as we walk through the DOM so the incidence of potential recursion increases, until it gets to the point where my brain explodes.

There’s a limit to how much recursion I can get my head around — or rather — a limit to how much I’m prepared to get my head around before I just go the heck with this, why can’t I mangle it as text with regular expressions!?

Unfortunately there doesn’t seem to be a way with PHP DOM to get the text equivalent of any arbitrary node, but we can do that at the Document or DocumentFragment level; so with a little toying-around I came up with a way to leverage that capability and make it work at the Node level.

So for example, let’s begin with this XML:

<?xml version="1.0" encoding="utf-8"?>
<root id="introduction">
	<div class="section">
		The fundamental data type is <code>Node</code>
	</div>
</root>

We have a reference to its DOM saved to a PHP variable called $xmldom. And we want to parse it so that the <code> element becomes a <jstype>, and the <div class="section"> becomes simply <section>, all without affecting the rest of the document.

Here’s the complete code to do it, which I’ll then talk through stage by stage:

$node = $xmldom->documentElement->firstChild;

$doc = new DOMDocument();
$doc->loadXML('<xmltext/>');
$node = $doc->importNode($node, true);
$doc->documentElement->appendChild($node);
$xmltext = ereg_replace('^.*<xmltext>(.*)<\/xmltext>.*$', '\\1', $doc->saveXML());

$xmltext = ereg_replace('<([\/]?)code>', '<\\1jstype>', $xmltext);
$xmltext = ereg_replace('<([\/]?)div[^>]*>', '<\\1section>', $xmltext);

$node = $xmldom->createDocumentFragment();
$node->appendXML($xmltext);

$xmldom->documentElement->replaceChild($node, $xmldom->documentElement->firstChild);

In the first step we get a reference to the element we want to work with, and save it to $node.

In the second step we create a new document, and use loadXML() to create a placeholder root node (the loadXML method converts text input to XML, and is one of the cornerstones of our process). Next we import the original node into that document, then use saveXML() to convert the whole document to text (the saveXML method converts an XML document to text, and is as critical as loadXML() for what we’re doing here). The text output is parsed using ereg_replace to remove the outer contents of the document (its prolog and root node) so that we’re left with a text equivalent of the original input node.

In the third step we do whatever text-based mangling we need; in this case it’s simple element name conversions, but it could be anything.

In the fourth step we want to convert our parsed text back into XML, and we do this by creating a document fragment, then using appendXML() to load the text and have it converted to XML (the appendXML method does the same thing as loadXML(), but it doesn’t require an entire document to be created).

Finally, in the fifth step we merge the processed XML back into our original document. The document fragment has the original document as its owner, so we can simply use the replaceChild method to replace the original node and its children with the processed version. (Whenever a document fragment is added to a document, only its children are actually added, the document fragment itself is discarded; DocumentFragment is a virtual construct and never actually appears in a document.)

Both the first and the final step are arbitrary — we could work with an entire document, or just a single node, and edit our referencing and merging statements accordingly. Or we could build a method from the inner steps, which accepts $node as an argument (and maybe an array of replacement expressions), and returns the processed node at the end:

function mangleXML($node)
{
	...
	
	return $node;
}

This post has 11 responses so far

  1. There’s a limit to how much recursion I can get my head around — or rather — a limit to how much I’m prepared to get my head around before I just go the heck with this, why can’t I mangle it as text with regular expressions!?

    If the replacement doesn’t depend on the context in the tree then a linear pass for each token is much simpler.

    You realise that the point of recursion is that you only ever need to get your head round 2 levels right? Which is kind of where you’re going with the very bottom code skeleton.

     
  2. Should be able to use $xmltext = $xmldom->documentElement->saveXML($node);.

    Now why that’s a document method rather than a node method, I’ve no idea.

     
  3. Are you some sort of idiot?

    Have you ever heard of XSL? Are you even aware of the fact that ereg() is deprecated?

    Seriously, WTF?

     
  4. Hi!

    Seems to be the perfect use case for XSLT instead of PHP and DOM. Any reason why you chose PHP?

    Gabriel

     
  5. What was the reason to do not use XSLT ?

     
  6. I just want paste a small “review” about your article, which I found on other blog ( http://blog.wombert.de/post/43374548/sitepoint-blogs-mangling-xml-as-text-with-php-dom ).

    “say that this article is the top contestant for the biggest fail of the year… he uses ereg functions (they always sucked, and will be gone in PHP6), parses and replaces XML by hand, has apparently never heard of XPath and most of all… XSL was designed to do all this, like… decades ago, but hey, it’s only 2008…”

    I agree in 100% with this opinion.

     
  7. I love regular expressions as much (more than?) the next guy, but it seems uncomfortably brittle to use them to “parse” XML. If XSLT is overkill for this sort of XML transformation, why not then instead use XPath to identify the nodes you want to change and a little bit of DOM manipulation to effect the change? E.g.:

    $d = new DOMDocument($xml); $d->loadXml($xml); $x = new DOMXPath($d); $replacements = array('//div[@class="section"]' => 'section', '//code' => 'jstype'); foreach ($replacements as $query => $newName) { foreach ($x->query($query) as $oldNode) { $newNode = $oldNode->ownerDocument->createElement($newName); foreach ($oldNode->childNodes as $child) { $newNode->appendChild($child->cloneNode(true)); } $oldNode->parentNode->replaceChild($newNode, $oldNode); } }
     
  8. (Ugh. Looks like the “code block” choice in the editor here doesn’t preserve newlines.)

     
  9. Unfortunately there doesn’t seem to be a way with PHP DOM to get the text equivalent of any arbitrary node

    Ummmh, Xpath?

    Seriously PHPs XSLT support in 5 is great, you can even call PHP functions from within the XSLT.

     
  10. 1) http://us.php.net/manual/en/class.domnode.php#domnode.props.textcontent

    2) http://svn.assembla.com/svn/php_domquery/trunk/DomQuery.php

     
  11. I wish i would have read this post about a week earlier.

     

Sponsored Links

Leave a response

You are not logged in, log in with your SitePoint Forum username and password.

-OR- Post Anonymously

* Make sure any code samples are escaped (i.e. ‘<b>’ becomes ‘&lt;b&gt;’).

If not logged in, your comments will be placed in a moderation queue. This means your comment may not appear until one of our moderators approves it.

SitePoint Marketplace

Buy and sell Websites, templates, domain names, hosting, graphics and more.

Logo Design, Web page Design and more!

99designs

  • Custom logo designs created ‘just for you’.
  • Pick the design you like best.
  • Only pay if you’re satisfied with the result.

Want More Traffic?

Get up to five quotes from qualified SEO specialists, with no obligation!

Get A Free SEO Quote Now!