A little over a year ago, a research team at Carnegie Mellon Univeristy launched reCAPTCHA, a plug-in CAPTCHA service for web sites that serves the dual purpose of fighting spam bots and helping the Internet Archive and other clients to make sense of digitized print content.
CAPTCHAs, those hard to read images web sites sometimes ask you to enter before submitting form data, can be an effective way to combat spam, but they’re also tremendous time sinks. Each day on the web people are confronted with a whopping 200 million CAPTCHA images, and deciphering them consumes 500,000 hours. The reCAPTCHA system makes brilliant use of that time to put people to work reading scanned text that optical recognition software (OCR) had difficulty in understanding.
The service, which is now employed by 40,000 web sites, uses a simple technique to get people to help in figuring out unknown scanned words. Each reCAPTCHA box presents users with two words — one that the system knows to be correct (a control word) and one that is unknown. If the user gets the control word correct, the system can assume that the other word also has a high likelihood of being correct. If enough users enter the same thing for that word, it can be used as a control word.
Of those 200 million daily CAPTCHAs, reCAPTCHA serves about 4 million, which is “the equivalent of 1500 people working full-time and transcribing 60 words per minute,” according to a report in this month’s Science. The service, which is free for web sites to use, has deciphered 440 million words for clients over the past year.
According to Ars Technica, reCAPTCHA is also very accurate. In a test that used a random sample of 250 New York Times articles from different time periods, OCR software managed just 84% accuracy on its own. When combined with reCAPTCHA, though, the accuracy rating shot up to 99.1%. That, says Ars, is comparable to professional transcription services where they employee two transcription experts whose work is verified by a third party.
It’s easy to see how reCAPTCHA’s use of the crowd is far more cost effective. Further, Ars reports that software designed to crack CAPTCHA images fails on reCAPTCHA, likely because the letter distortions on scanned images are not the result of “clean mathematical transformation,” and thus are hard for a computer to correct.
reCAPTCHA is a simply brilliant use of essentially wasted time, and I’m pleased to hear that it’s working. When I first wrote about the program last year for ReadWriteWeb I noted that in college one of my classes was part of a project to digitize old maritime journals. We used expensive overhead scanners and fancy OCR software, but even so most of our time was spent correcting mistakes that the software had made. The reCAPTCHA system would have been a welcome addition to our work back then.