Nowadays, many web developers have moved toward HTML5 over XHTML. But did you know web documents can be both HTML and XML-based at the same time? This is what’s known as polyglot markup: HTML documents that can correctly be served as either text/html or as an XML MIME-type-like application/xml or application/xhtml+xml. Like a polyglot person, who speaks more than one language, a polyglot document “speaks” both HTML and XML.
However, note that the aim is to conform to both HTML5 and XML well-formedness; polyglot documents do not have to be valid XHTML.
The Constraints of Going Polyglot
Creating polyglot documents enforces more constraints and structure because they must conform to XML rules for well-formedness. For example, HTML element names and attribute names must typically be in lowercase, and all elements must have an end tag or use the minimized tag syntax (like
<br/>). The trick is to make sure that the document parses into identical document trees (though there are some exceptions), whether it is processed by an HTML parser or by an XML parser.
By doing this, your documents will almost assuredly be better structured and of higher quality, yet still be able to be treated as HTML5. Another benefit is that they can be processed by XML tools. Also, if you need HTML and XHTML versions of a page, then you won’t need to maintain two different copies of content (which is almost always a bad idea). With a polyglot document, you can serve it as HTML when you need to or as XHTML when you need to, without changing any content in the document.
The W3C Recommendations
The W3C has a working draft called Polyglot Markup: HTML-Compatible XHTML Documents, which details design guidelines for polyglot documents. I’ve summarized some of these guidelines below. For more detail, you can review the W3C document, which is actually not that long and not that difficult to read. Here’s the summary:
- Do not use
document.writeln()because these may not be used in XML. Use the
- Do not use the
noscriptelement because it cannot be used in XML documents.
- Do not use XML processing instructions or an XML declaration.
- Use UTF-8 encoding, and declare it in one of the ways listed in the W3C document. I recommend using
- Use an acceptable DOCTYPE, like
<!DOCTYPE html>. Do not use DOCTYPE declarations for HTML4 or previous versions of HTML.
- To maintain XML compatibility, explicitly declare the default namespaces for “html”, “math”, and “svg” elements, like
- If using any attributes in the XLink namespace, then declare the namespace on the
htmlelement or once on the foreign element where it is used.
- Every polyglot document must have at least these elements (they cannot be left out):
trelement must be explicitly wrapped in a
tfootelement to keep the HTML and XML DOMs consistent.
colelement in a
tableelement must be explicitly wrapped in a
colgroupelement. Again, this is to keep the HTML and XML DOMs consistent.
- Use the correct case for element names. Only lowercase letters may be used for HTML and MathML element names, though some SVG elements must use only lowercase and some must use mixed case.
- Use the correct case for attribute names. Only lowercase letters may be used for HTML and MathML attribute names, with the exception of definitionURL. Some SVG attribute names must use only lowercase and some must use mixed case.
- Maintain case consistency on attribute values. An easy way to do this is to only use lowercase, but this is not required.
- Only certain elements can be void. These elements must use the minimized tag syntax like
<br/> (no end tags allowed). Some of these void elements are:
- If the HTTP Content-Language header specifies exactly one language tag, specify the language using both the
xml:langattributes on the
- Do not begin the text inside of a
preelement with a newline.
- All attribute values must be surrounded by either single or double quotation marks.
- Do not use newline characters within an attribute value.
- Do not use the
xml:baseattributes, except in foreign content like MathML and SVG. These attributes are not valid in documents served as text/html.
- When specifying a language, use both the
xml:langattributes. Do not use one attribute without the other, and both must have identical values.
- Use only the following named entity references: amp, lt, gt, apos, quot. For others, use the decimal or hexadecimal values instead of named entities.
- Always use character references for the less-than sign and the ampersand, except when used in a CDATA section.
- Whenever possible (though not required),
styleelements should link to external files rather than including them inline (this is good advice even for non-polyglot documents). However, when inline content is used, it should be “safe content” that does not contain any problematic less-than or ampersand characters (escaping them is not an option due to the creation of different DOMs). I also recommend wrapping inline script content in a CDATA section, with the CDATA markers commented out (use
//<![CDATA[as the first line before the script and
//]]>as the last line, using “//” to comment out the CDATA markers). But again, you can avoid these issues by using external files rather than inline content.
Because of the many guidelines above, it is highly recommended that a polyglot-capable checker be used. Of course, you are already validating your current HTML and/or XHTML documents, right? If not, then I recommend that you start doing so immediately because there are many advantages to having higher-quality HTML. For a polyglot checker, you can try the new polyglot checking in the upcoming v12 release of CSE HTML Validator (for Windows). Download and install the free public beta and set the “Validate HTML documents as” option to Polyglot in the DOCTYPE Control options page of the Validator Engine Options. You can use the fully functional public beta for free until January 31, 2013.
If your documents are already in XHTML, then converting them to polyglot should be easy. Converting HTML documents may be more work, but you may find that the extra structure and quality requirements are a breath of fresh air compared to the sloppiness that is allowed by HTML5. Better structured documents are also easier to read and maintain, and less likely to be misinterpreted by browsers and search engines that are crawling your site.