Calais, the Semantic Web service, from Thomson Reuters, is today announcing a new commercial version at the EmTech Conference on the MIT campus in Cambridge, Massachusetts. Calais is a web service and open API that allows web publishers to automatically scan content and pull out semantic metadata. In other words, the services built on the Calais API can semantically mark up content automatically.
According to Tom Tague, who leads the Calais initiative at Thomson Reuters, they finally reached a critical mass of people using Calais, including both large companies and smaller web startups, who were telling them that in order to really utilize Calais, they needed a professional version with an SLA. So Tague and company responded with the professional version, which for $2,000 per month and a one year commitment, comes with 24×7 monitoring, and 100,000 transactions per day (20 per second), up from 40,000 per day and 4 per second on the free version.
Tague told me that most users asking for the professional version didn’t need more volume, they just needed a guarantee that the service would be available and that Thomson Reuters was serious about keeping it going.
In addition to the professional edition, Calais is also announcing an enterprise version that can be installed on-site for clients that can’t let their content out of their firewall. Tague tells me that the enterprise version will appeal to clients dealing with health records, financial data, or other sensitive information, or to clients who require a very large volume of transactions where it makes sense to do the processing locally rather than sending it out to a service that exists in the cloud.
One of the biggest knocks against Calais early on was that because of its early pedigree as a business application called Clear Forest (which Thomson Reuters acquired), it was biased toward business language. That meant that it was of limited usefulness for sites that didn’t deal with business topics. I asked Tague for an update on their progress in expanding Calais’ vocabulary to understand semantics outside of the business realm, and he told that the Calais had improved by leaps and bounds since it first launched.
The vocabulary has grown by about 40% since the Clear Forest days, according to Tague, and now includes pop culture entities such as musical groups, events, entertainers, and sports teams, as well as healthcare industry entity types. Calais is even working with some clients to create specialized vocabularies, and has about a dozen full time natural language programmers adding new entities at the rate of 10-12 items per month. Tague says that he can’t recall hearing the “too focused on business” complaint at all in the past three or four months.
Even though the professional version of Calais was the big news, Tague was more excited to talk to me about SemanticProxy, a new service from Open Calais.
SemanticProxy, which is built on Calais 3.0, works like a proxy server for extracting semantic information from web content. It takes a URL, fetches the page, cleans it up and processes it with Calais, and then returns semantic metadata in HTML, RDF, or Microformats.
“In the future, the Web will be one giant yet tightly interconnected information asset that delivers the content and services people need in the fashion and format they desire. Beyond publishing information for people, every site will expose its content in a way that’s readable by machines. Machines will mix, match, filter and aggregate information to greatly improve the experience for everyone,” said Tague. Unfortunately, for a lot of publishers, investing in semantically marking up their content is infeasible — either because they have overwhelmingly large back catalogs of content that needs attention, or because they publish transient content (such as news) that is only read for a short time.
The goal with Calais and services built on it like SemanticProxy are to remove the barriers to marking up and adding semantics to content. “The Semantic Web is going to be a critical mass play,” Tague told me. You need enough publishers to produce semantically marked up content for the vision to work, and the easier you make it for them to add semantics to their content, the more it will happen.
Like Yahoo!’s Search Monkey, which encourages the use of RDF and other semantic markup, Calais and SemanticProxy will help publishers along the road to the Semantic Web by stimulating activity and making it easier to markup content.
Josh Catone joined Mashable in May 2009 and is Executive Director of Editorial Projects. Before joining Mashable, Josh was the Lead Writer at ReadWriteWeb, the Lead Blogger at SitePoint, and the Community Evangelist at DandyID.