reCAPTCHA: Awesome Use of Wasted Time That Works

A little over a year ago, a research team at Carnegie Mellon Univeristy launched reCAPTCHA, a plug-in CAPTCHA service for web sites that serves the dual purpose of fighting spam bots and helping the Internet Archive and other clients to make sense of digitized print content.

CAPTCHAs, those hard to read images web sites sometimes ask you to enter before submitting form data, can be an effective way to combat spam, but they’re also tremendous time sinks. Each day on the web people are confronted with a whopping 200 million CAPTCHA images, and deciphering them consumes 500,000 hours. The reCAPTCHA system makes brilliant use of that time to put people to work reading scanned text that optical recognition software (OCR) had difficulty in understanding.

The service, which is now employed by 40,000 web sites, uses a simple technique to get people to help in figuring out unknown scanned words. Each reCAPTCHA box presents users with two words — one that the system knows to be correct (a control word) and one that is unknown. If the user gets the control word correct, the system can assume that the other word also has a high likelihood of being correct. If enough users enter the same thing for that word, it can be used as a control word.

Of those 200 million daily CAPTCHAs, reCAPTCHA serves about 4 million, which is “the equivalent of 1500 people working full-time and transcribing 60 words per minute,” according to a report in this month’s Science. The service, which is free for web sites to use, has deciphered 440 million words for clients over the past year.

According to Ars Technica, reCAPTCHA is also very accurate. In a test that used a random sample of 250 New York Times articles from different time periods, OCR software managed just 84% accuracy on its own. When combined with reCAPTCHA, though, the accuracy rating shot up to 99.1%. That, says Ars, is comparable to professional transcription services where they employee two transcription experts whose work is verified by a third party.

It’s easy to see how reCAPTCHA’s use of the crowd is far more cost effective. Further, Ars reports that software designed to crack CAPTCHA images fails on reCAPTCHA, likely because the letter distortions on scanned images are not the result of “clean mathematical transformation,” and thus are hard for a computer to correct.

reCAPTCHA is a simply brilliant use of essentially wasted time, and I’m pleased to hear that it’s working. When I first wrote about the program last year for ReadWriteWeb I noted that in college one of my classes was part of a project to digitize old maritime journals. We used expensive overhead scanners and fancy OCR software, but even so most of our time was spent correcting mistakes that the software had made. The reCAPTCHA system would have been a welcome addition to our work back then.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://www.flywalk.co.uk themightystephen

    Cool invention.

  • http://www.hostedwebtools.com Darren884

    I heard about this on the radio… they say within 1 year they will have completed a library full of books… this is one of the most innovative inventions ever.

  • ganowns

    Very innovative stuff..

  • http://www.howrank.com mkoenig

    I heard this on NPR Friday or Thursday. Great idea. :)

  • khreativ

    I learned about it last month as my current employers are implementing it in their sites to combat spammers. I’ve also decided to use it in my personal site as well. Recaptcha is a great idea!!

  • http://www.xeninesolutions.com bvarvel

    That is freakin’ brilliant.

  • used1

    It seems brilliant, and you can make arguments as to how nice it is, but after having to enter values in this even tool over 100 times I can tell you it is truly remarkably awful.

  • http://sunrisersalumni.org bemmott

    Awesome, my ass. Try those damned things if you have ANY eyesight problems and then tell me how great they are. I had to go through 8 sets of words this morning on one sight before I found something I could make out.

  • ben332211

    bemmott may not have noticed the Audio button on the right that plays a sound with 8 numbers spoken instead. Obviously not great if you have trouble with hearing as well as eyesight, but it is another option, :)

  • SteveJ

    I hate it. It doesn’t “use wasted time”. It doubles the amount of time it takes to complete the CAPTCHA, and uses the second half for something constructive. The first half is still wasted: the second half merely stolen.

    My next CAPTCHA system design is that if the user can come round my house and wash the dishes, then they’re human.

  • sherazul

    Ready start. we have 30 pc 90 worker & 24/7 nonstop support worker. we 7 years experience captcha entry field. if you need our service pls contact workcaptcha@yahoo.com we online 24/7 nonstop. thnx workcaptcha sumon