Open Your Data Up to Bots Using Microdata

By Alexis Goldstein , Estelle Weyl , Louis Lazaris

htmlcss2thumb

The following is an extract from our book, HTML5 & CSS3 for the Real World, 2nd Edition, written by Alexis Goldstein, Louis Lazaris, and Estelle Weyl. Copies are sold in stores worldwide, or you can buy it in ebook form here.

Microdata is another technology that’s rapidly gaining adoption and support, but, unlike WAI-ARIA, it’s technically part of HTML5. Although still early in development, it’s worth mentioning the Microdata specification here, because the technology provides a peek into what may be the future of document readability and semantics.

In the spec, Microdata is defined as a mechanism that “allows machine-readable data to be embedded in HTML documents in an easy-to-write manner, with an unambiguous parsing model.”

With Microdata, page authors can add specific labels to HTML elements, annotating them so that they can be read by machines or bots. This is done by means of a customized vocabulary. For example, you might want a script or other third-party service to be able to access your pages and interact with specific elements on the page in a certain manner. With Microdata, you can extend existing semantic elements (such as article and figure) to allow those services to have specialized access to the annotated content.

This can appear confusing, so let’s think about a real-world example. Let’s say your site includes reviews of movies. You might have each review in an article element, with a number of stars or a percentage score for your review. But when a machine comes along, such as Google’s search spider, it has no way of knowing which part of your content is the actual review—all it sees is a bunch of text on the page.

Why would a machine want to know what you thought of a movie? It’s worth considering that Google has started displaying richer information in its search results pages, in order to provide searchers with more than just textual matches for their queries. It does this by reading the review information encoded into those sites’ pages using Microdata or other similar technologies. An example of movie review information is shown below.

rich_snippet

By using Microdata, you can specify exactly which parts of your page correspond to reviews, people, events, and more—all in a consistent vocabulary that software applications can understand and make use of.

Aren’t HTML5’s semantics enough?

The HTML5 spec now includes a number of new elements to allow for more expressive markup. But it would be counterproductive to add elements to HTML that would only be used by a handful of people. This would bloat the language, making its features unmaintainable from all perspectives—whether that’s specification authors, browser vendors, or standards bodies.

Microdata, on the other hand, allows developers to use custom vocabularies (either existing ones or their own) for specific situations—ones that aren’t possible using HTML5’s semantic elements. Thus existing HTML elements and new elements added in HTML5 are kept as a sort of semantic baseline, while specific annotations can be created by developers to target their own needs.

The Microdata Syntax

Microdata works with existing, well-formed HTML content, and is added to a document by means of name-value pairs (also called properties). Microdata prohibits you from creating new elements; instead it gives you the option to add customized attributes that expand on the semantics of existing elements.

Here’s a simple example:

<aside itemscope> 
  <h1 itemprop="name">John Doe</h1> 
  <p><img src="http://www.sitepoint.com/bio-photo.jpg" alt="John Doe" itemprop="photo">&lt/p>
  <p><a href="http://www.sitepoint.com" itemprop="url">Author’s website</a></p>
</aside>

In the example above, we have a run-of-the-mill author bio placed inside an aside element. The first oddity you’ll notice is the Boolean itemscope attribute. This identifies the aside element as the container that defines the scope of our Microdata vocabulary. The presence of the itemscope attribute defines what the spec refers to as an item. Each item is characterized by a group of name-value pairs.

The ability to define the scope of our vocabularies allows us to define multiple vocabularies on a single page. In this example, all name-value pairs inside the aside element are part of a single Microdata vocabulary.

After the itemscope attribute, the next item of interest is the itemprop attribute, which has a value of "name". At this point, it’s probably a good idea to explain how a script would obtain information from these attributes, as well as what we mean by “name-value pairs.”

Understanding Name-Value Pairs

A name is a property defined with the help of the itemprop attribute. In our example, the first property name happens to be one called name. There are two additional property names in this scope: photo and url.

The values for a given property are defined differently, depending on the element the property is declared on. For most elements, the value is taken from its text content; for instance, the name property in our example would obtain its value from the text content between the opening and closing h1 tags. Other elements are treated differently.

The photo property takes its value from the src attribute of the image, so the value consists of a URL pointing to the author’s photo. The url property, although defined on an element that has text content (namely, the phrase “Author’s website”), doesn’t use this text content to determine its value; instead, it obtains its value from the href attribute.

Other elements that don’t use their associated text content to define Microdata values include meta, iframe, object, audio, link, and time. For a comprehensive list of elements that obtain their values from somewhere other than the text content, see the Values section of the Microdata specification.

Microdata Namespaces

What we’ve described so far is acceptable for Microdata that’s not intended to be reused—but that’s a little impractical. The real power of Microdata is unleashed when, as we discussed, third-party scripts and page authors can access our name-value pairs and find beneficial uses for them.

In order for this to happen, each item must define a type by means of the itemtype attribute. Remember that an item in the context of Microdata is the element that has the itemscope attribute set. Every element and name-value pair inside that element is part of that item. The value of the itemtype attribute, therefore, defines the namespace for that item’s vocabulary. Let’s add an itemtype to our example:

<aside itemscope itemtype="http://schema.org/Person">
  <h1 itemprop="name">John Doe</h1>
  <p><img src="http://www.sitepoint.com/bio-photo.jpg" alt="John Doe" itemprop="photo"></p>
  <p><a href="http://www.sitepoint.com" itemprop="url">Author’s website</a></p>
</aside>

In our item, we’re using the “http://schema.org/Person” URL, which is from Schema.org, a collaborative project supported by several major search engines. This website houses a number of Microdata vocabularies, including Organization, Person, Review, Event, and more.

Further Reading

This brief introduction to Microdata barely does the topic justice, but we hope it will provide you with a taste of what’s possible when extending the semantics of your documents with this technology.

It’s a very broad topic that requires reading and research outside of this source. With that in mind, here are a few links to check out if you want to delve deeper into the possibilities offered by Microdata:

  • With the recently published RDFa 1.1, it no long requires XHTML and can in fact be used in HTML5 like microdata.

Recommended
Sponsors
Get the latest in Front-end, once a week, for free.