By Albert Wiersch

Have You Considered Polyglot Markup?

By Albert Wiersch

Nowadays, many web developers have moved toward HTML5 over XHTML. But did you know web documents can be both HTML and XML-based at the same time? This is what’s known as polyglot markup: HTML documents that can correctly be served as either text/html or as an XML MIME-type-like application/xml or application/xhtml+xml. Like a polyglot person, who speaks more than one language, a polyglot document “speaks” both HTML and XML.

However, note that the aim is to conform to both HTML5 and XML well-formedness; polyglot documents do not have to be valid XHTML.

The Constraints of Going Polyglot

Creating polyglot documents enforces more constraints and structure because they must conform to XML rules for well-formedness. For example, HTML element names and attribute names must typically be in lowercase, and all elements must have an end tag or use the minimized tag syntax (like <br/>). The trick is to make sure that the document parses into identical document trees (though there are some exceptions), whether it is processed by an HTML parser or by an XML parser.

By doing this, your documents will almost assuredly be better structured and of higher quality, yet still be able to be treated as HTML5. Another benefit is that they can be processed by XML tools. Also, if you need HTML and XHTML versions of a page, then you won’t need to maintain two different copies of content (which is almost always a bad idea). With a polyglot document, you can serve it as HTML when you need to or as XHTML when you need to, without changing any content in the document.

Going back to the idea of identical DOMs (Document Object Models), remember that this is critical because browsers don’t render HTML directly. Instead they create a DOM from the document source and render that. Also, it’s the DOM that is manipulated by CSS and JavaScript. If different DOMs are created, then documents might render differently, especially if using CSS or JavaScript causes inconsistent changes to the different DOMs. Therefore, many of the guidelines are for the purpose of making sure that identical DOMs are maintained regardless of the parser being used to parse the polyglot document.

The W3C Recommendations

The W3C has a working draft called Polyglot Markup: HTML-Compatible XHTML Documents, which details design guidelines for polyglot documents. I’ve summarized some of these guidelines below. For more detail, you can review the W3C document, which is actually not that long and not that difficult to read. Here’s the summary:

  • Do not use document.write() or document.writeln() because these may not be used in XML. Use the innerHTML property instead.
  • Do not use the noscript element because it cannot be used in XML documents.
  • Do not use XML processing instructions or an XML declaration.
  • Use UTF-8 encoding, and declare it in one of the ways listed in the W3C document. I recommend using <meta charset="UTF-8"/>.
  • Use an acceptable DOCTYPE, like <!DOCTYPE html>. Do not use DOCTYPE declarations for HTML4 or previous versions of HTML.
  • To maintain XML compatibility, explicitly declare the default namespaces for “html”, “math”, and “svg” elements, like <html xmlns="">.
  • If using any attributes in the XLink namespace, then declare the namespace on the html element or once on the foreign element where it is used.
  • Every polyglot document must have at least these elements (they cannot be left out): html, head, title, and body.
  • Every tr element must be explicitly wrapped in a tbody, thead or tfoot element to keep the HTML and XML DOMs consistent.
  • Every col element in a table element must be explicitly wrapped in a colgroup element. Again, this is to keep the HTML and XML DOMs consistent.
  • Use the correct case for element names. Only lowercase letters may be used for HTML and MathML element names, though some SVG elements must use only lowercase and some must use mixed case.
  • Use the correct case for attribute names. Only lowercase letters may be used for HTML and MathML attribute names, with the exception of definitionURL. Some SVG attribute names must use only lowercase and some must use mixed case.
  • Maintain case consistency on attribute values. An easy way to do this is to only use lowercase, but this is not required.
  • Only certain elements can be void. These elements must use the minimized tag syntax like <br/&gt; (no end tags allowed). Some of these void elements are: area, br, embed, hr, img, input, link, and meta.
  • If the HTTP Content-Language header specifies exactly one language tag, specify the language using both the lang and xml:lang attributes on the html element.
  • Do not begin the text inside of a textarea or pre element with a newline.
  • All attribute values must be surrounded by either single or double quotation marks.
  • Do not use newline characters within an attribute value.
  • Do not use the xml:space or xml:base attributes, except in foreign content like MathML and SVG. These attributes are not valid in documents served as text/html.
  • When specifying a language, use both the lang and xml:lang attributes. Do not use one attribute without the other, and both must have identical values.
  • Use only the following named entity references: amp, lt, gt, apos, quot. For others, use the decimal or hexadecimal values instead of named entities.
  • Always use character references for the less-than sign and the ampersand, except when used in a CDATA section.
  • Whenever possible (though not required), script and style elements should link to external files rather than including them inline (this is good advice even for non-polyglot documents). However, when inline content is used, it should be “safe content” that does not contain any problematic less-than or ampersand characters (escaping them is not an option due to the creation of different DOMs). I also recommend wrapping inline script content in a CDATA section, with the CDATA markers commented out (use //<![CDATA[ as the first line before the script and //]]> as the last line, using “//” to comment out the CDATA markers). But again, you can avoid these issues by using external files rather than inline content.

Because of the many guidelines above, it is highly recommended that a polyglot-capable checker be used. Of course, you are already validating your current HTML and/or XHTML documents, right? If not, then I recommend that you start doing so immediately because there are many advantages to having higher-quality HTML. For a polyglot checker, you can try the new polyglot checking in the upcoming v12 release of CSE HTML Validator (for Windows). Download and install the free public beta and set the “Validate HTML documents as” option to Polyglot in the DOCTYPE Control options page of the Validator Engine Options. You can use the fully functional public beta for free until January 31, 2013.

If your documents are already in XHTML, then converting them to polyglot should be easy. Converting HTML documents may be more work, but you may find that the extra structure and quality requirements are a breath of fresh air compared to the sloppiness that is allowed by HTML5. Better structured documents are also easier to read and maintain, and less likely to be misinterpreted by browsers and search engines that are crawling your site.

  • Patrick

    What exactly is the point of this? Almost every website uses a text/html MIME type, even when serving XHTML (which, ironically, makes the use of XHTML kinda pointless). Internet explorer doesn’t play nicely with application/xhtml+xml, making it unsuitable for public websites.

    “Also, if you need HTML and XHTML versions of a page”

    I can’t even imagine a scenario where this would be necessary. So far the benefits of doing this seem highly questionable.

    On the other hand, it’s suggested that we don’t use things like the noscript tag, which is extremely helpful. We’re making big sacrifices.

    I guess my question is, what real-world problem does this solve?

    • If the noscript tag is so helpful, even in XHTML – eventually for being able to transition between XHTML5 and HTML5, then feel free to file a bug against the HTML5 specification, which is the one that forbids it.

      • To follow up on myself: What I mean is that the forbiddance of in XHTML is “just” a “human” law. It is not a law dictated by God himself — XML. So you could include in XHTML. It is just that the noscript magic would not happen in an XML parser, meaning that you would have to do some CSS and/or scripting tricks, instead, to imitate how it works in HTML.

        From a polyglot markup point of view, whether to allow , would depend, on the simplest level, whether the HTML5 spec allows in XHTML5. On a more philosophical level, it depens on whether it is defendable to allow it in polyglot markup.

        That said, I am not sure the element is actually that useful.

        * It has been a little bit useful for support legacy versions of Internet Explorer. But e.g. Google is now fasing out support for legacy IE – before end of this year, they don’t even support IE8.
        * Also, when implementing new elements, such as the proposed element, then some advocate its use. But I am not convinced it is actually needed.

        Would be interesting to hear when you consider it useful/necessary.

  • Hi Patrick,

    Some reasons you may want to use polyglot documents:

    * You want better structure & consistency. HTML can be “messy” because much is allowed. HTML allows both uppercase and lowercase and markup does not have to be well-formed. Quoted attributes might be optional, depending on the value, and some attributes don’t even need values.

    * You want better quality. This goes along with the first item. Also, “application/xhtml+xml” can signify quality.

    * Having better document structure and quality can result in easier maintenance and improved re-usability.

    * It’s easier to process and manipulate with XML tools, should you have the need to do so.

    If you have no desire or need for any of the above, or you find the effort not worth the benefit in your case, then there’s no need need to trouble yourself with polyglot… otherwise you may want to consider it. Almost everyone’s needs and requirements are different.

    Another good page with more is here:

    • Patrick

      “You want better structure & consistency. HTML can be “messy” because much is allowed.” – HTML is flexible by design. Most websites pick a coding style (upper or lowercase, whether to use attribute quotes, etc.) and stick to it throughout the site. It’s easier to enforce consistency with XML tools, but any decent coder will be consistent with their code in any language anyway.

      “Also, “application/xhtml+xml” can signify quality.” – Sorry, but I think that’s a ludicrous statement. It signifies XHTML, and that’s it. It’s possible to write horrible, non-semantic, overly-verbose XHTML, and it’s possible to write well-formed, elegant and efficient HTML. I really hope you don’t honestly believe that the MIME type someone uses is a reflection on their coding ability. It’s also worth noting that the overwhelming majority of XHTML websites use text/html, mostly because of compatibility problems with Internet Explorer.

      “Having better document structure and quality can result in easier maintenance and improved re-usability.” – This is true, but you can have a well structured HTML document. It’s up to each coder to decide on a coding style and then stick to it. As long as code style is consistent it should be easy to maintain.

      XML tools can be useful, but for the vast majority of websites they would be unnecessary.

      I used to use XHTML for everything. I’ve since switched to HTML, and my code is just as consistent, readable and maintainable as it was before. HTML 5 is the future, and except in very limited circumstances, I just can’t see the practical benefit of polyglot, especially given how inconvenient it is to use.

      • Albert Wiersch

        Hi Patrick,

        A “decent coder” would certainly be consistent in their style, but they’re still “only human”. Polyglot documents enforce better consistency. That is a good thing. And what if there are multiple developers? They might all have different “standards” making consistency virtually impossible without some type of enforcement.

        I agree one can write junk XHTML, but they don’t have to. Someone who is serious about quality would make sure their XHTML is written correctly, and if I see an XHTML page AND it’s well-formed and follows all the rules, then that says “quality” to me, more so than just an HTML 5 page.

        While polyglot documents may be more inconvenient to write, there’s a payoff in improved quality, consistency, and re-usability for that inconvenience. The question of whether it’s worth it is up to each developer. Some will find it worth the extra effort, others won’t.

      • Patrick

        Well, if there are multiple coders, I’d hope there’d be an established set of project rules to define coding style. And yes, it’s true that XML tools can enforce standards, but HTML can be validated too.

        HTML is not all that complicated – if you have simple rules like “use lowercase” and “use quotes around attributes”, it isn’t at all hard to stick to them. And if you do slip up and make a mistake somewhere, it won’t break the page, it’ll be easy to fix, and it probably won’t make much difference to the overall maintainability.

        I don’t get why you’re on this weird XHTML elitism kick. It’s kinda out-dated. I used to be a huge proponent of XHTML too, until I learned that IE6, 7, and 8 do not support the XHTML MIME type. Almost all XHTML pages are sent as text/html, which actually causes browsers to interpret them as malformed HTML. For example, SitePoint uses XHTML sent as text/html, just like almost every other XHTML site. How does using the wrong MIME type indicate quality?

        Even if that wasn’t the case, I don’t understand why you would equate XHTML with better quality than equivalent HTML. Why is a well-formed XHTML page better than a well-formed HTML page? Why do you consider HTML 5 to be of lower quality, especially given the MIME type issue I mentioned? HTML 5 is the future, XHTML is stagnant.

        My problem with polyglot isn’t even that it’s harder to write. It’s that we have to sacrifice useful things, like noscript tags (essential for many modern websites) to be able to do it. So couldn’t we have a tool that enforces the format of HTML documents without these sacrifices? Why does it have to be XML based? A tweaked version of the W3C HTML validator would work fine.

        XML and HTML are similar, but not the same. They have different syntaxes and, more importantly, different purposes. By trying to conflate these ideas, it’s like you’re trying to push a square block through a circular hole – it isn’t going to work without breaking something.

      • Albert Wiersch

        Patrick, polyglot documents allow you to serve them using “text/html”, and it’s a correct MIME type, but so is the XHTML mime type. It’s like the best of both worlds. Either MIME type will work and both are correct when it’s a polyglot document.

  • Tiny correction: There’s a [ missing in “use //<![CDATA as the first line”; it should read: “use //<![CDATA[ as the first line”.

  • Sam Ruby obliquely references this article and some of the discussion points raised in his blog post In defence of Polyglot.

Get the latest in Front-end, once a week, for free.