By Albert Wiersch

Have You Considered Polyglot Markup?

By Albert Wiersch

Nowadays, many web developers have moved toward HTML5 over XHTML. But did you know web documents can be both HTML and XML-based at the same time? This is what’s known as polyglot markup: HTML documents that can correctly be served as either text/html or as an XML MIME-type-like application/xml or application/xhtml+xml. Like a polyglot person, who speaks more than one language, a polyglot document “speaks” both HTML and XML.

However, note that the aim is to conform to both HTML5 and XML well-formedness; polyglot documents do not have to be valid XHTML.

The Constraints of Going Polyglot

Creating polyglot documents enforces more constraints and structure because they must conform to XML rules for well-formedness. For example, HTML element names and attribute names must typically be in lowercase, and all elements must have an end tag or use the minimized tag syntax (like <br/>). The trick is to make sure that the document parses into identical document trees (though there are some exceptions), whether it is processed by an HTML parser or by an XML parser.

By doing this, your documents will almost assuredly be better structured and of higher quality, yet still be able to be treated as HTML5. Another benefit is that they can be processed by XML tools. Also, if you need HTML and XHTML versions of a page, then you won’t need to maintain two different copies of content (which is almost always a bad idea). With a polyglot document, you can serve it as HTML when you need to or as XHTML when you need to, without changing any content in the document.

Going back to the idea of identical DOMs (Document Object Models), remember that this is critical because browsers don’t render HTML directly. Instead they create a DOM from the document source and render that. Also, it’s the DOM that is manipulated by CSS and JavaScript. If different DOMs are created, then documents might render differently, especially if using CSS or JavaScript causes inconsistent changes to the different DOMs. Therefore, many of the guidelines are for the purpose of making sure that identical DOMs are maintained regardless of the parser being used to parse the polyglot document.

The W3C Recommendations

The W3C has a working draft called Polyglot Markup: HTML-Compatible XHTML Documents, which details design guidelines for polyglot documents. I’ve summarized some of these guidelines below. For more detail, you can review the W3C document, which is actually not that long and not that difficult to read. Here’s the summary:

  • Do not use document.write() or document.writeln() because these may not be used in XML. Use the innerHTML property instead.
  • Do not use the noscript element because it cannot be used in XML documents.
  • Do not use XML processing instructions or an XML declaration.
  • Use UTF-8 encoding, and declare it in one of the ways listed in the W3C document. I recommend using <meta charset="UTF-8"/>.
  • Use an acceptable DOCTYPE, like <!DOCTYPE html>. Do not use DOCTYPE declarations for HTML4 or previous versions of HTML.
  • To maintain XML compatibility, explicitly declare the default namespaces for “html”, “math”, and “svg” elements, like <html xmlns="">.
  • If using any attributes in the XLink namespace, then declare the namespace on the html element or once on the foreign element where it is used.
  • Every polyglot document must have at least these elements (they cannot be left out): html, head, title, and body.
  • Every tr element must be explicitly wrapped in a tbody, thead or tfoot element to keep the HTML and XML DOMs consistent.
  • Every col element in a table element must be explicitly wrapped in a colgroup element. Again, this is to keep the HTML and XML DOMs consistent.
  • Use the correct case for element names. Only lowercase letters may be used for HTML and MathML element names, though some SVG elements must use only lowercase and some must use mixed case.
  • Use the correct case for attribute names. Only lowercase letters may be used for HTML and MathML attribute names, with the exception of definitionURL. Some SVG attribute names must use only lowercase and some must use mixed case.
  • Maintain case consistency on attribute values. An easy way to do this is to only use lowercase, but this is not required.
  • Only certain elements can be void. These elements must use the minimized tag syntax like <br/>; (no end tags allowed). Some of these void elements are: area, br, embed, hr, img, input, link, and meta.
  • If the HTTP Content-Language header specifies exactly one language tag, specify the language using both the lang and xml:lang attributes on the html element.
  • Do not begin the text inside of a textarea or pre element with a newline.
  • All attribute values must be surrounded by either single or double quotation marks.
  • Do not use newline characters within an attribute value.
  • Do not use the xml:space or xml:base attributes, except in foreign content like MathML and SVG. These attributes are not valid in documents served as text/html.
  • When specifying a language, use both the lang and xml:lang attributes. Do not use one attribute without the other, and both must have identical values.
  • Use only the following named entity references: amp, lt, gt, apos, quot. For others, use the decimal or hexadecimal values instead of named entities.
  • Always use character references for the less-than sign and the ampersand, except when used in a CDATA section.
  • Whenever possible (though not required), script and style elements should link to external files rather than including them inline (this is good advice even for non-polyglot documents). However, when inline content is used, it should be “safe content” that does not contain any problematic less-than or ampersand characters (escaping them is not an option due to the creation of different DOMs). I also recommend wrapping inline script content in a CDATA section, with the CDATA markers commented out (use //<![CDATA[ as the first line before the script and //]]> as the last line, using “//” to comment out the CDATA markers). But again, you can avoid these issues by using external files rather than inline content.

Because of the many guidelines above, it is highly recommended that a polyglot-capable checker be used. Of course, you are already validating your current HTML and/or XHTML documents, right? If not, then I recommend that you start doing so immediately because there are many advantages to having higher-quality HTML. For a polyglot checker, you can try the new polyglot checking in the upcoming v12 release of CSE HTML Validator (for Windows). Download and install the free public beta and set the “Validate HTML documents as” option to Polyglot in the DOCTYPE Control options page of the Validator Engine Options. You can use the fully functional public beta for free until January 31, 2013.

If your documents are already in XHTML, then converting them to polyglot should be easy. Converting HTML documents may be more work, but you may find that the extra structure and quality requirements are a breath of fresh air compared to the sloppiness that is allowed by HTML5. Better structured documents are also easier to read and maintain, and less likely to be misinterpreted by browsers and search engines that are crawling your site.

The most important and interesting stories in tech. Straight to your inbox, daily. Get Versioning.
Login or Create Account to Comment
Login Create Account