reCaptcha uses one problem to crack another

You’ve heard of Amazon’s Mechanical Turk function call, which provides a software interface to human agents? A call to the Turk offers a task – anything a human might do and a computer can’t, like deciding “Does this image include a face?” – to the community, and acts as agent for any appropriate payment as well.

And you’re familiar too with the grid approach to large scale problems, farming out small components of a task to large numbers of distributed PCs and using their idle cycles to contribute to the task.

One task for the Turk might be interpreting the parts of scanned text which Optical Character Recognition (OCR) software fails to successfully digitise. Some years ago I was involved with scanning in the American Petroleum Institute thesaurus: a very large volume. We had a success rate of about 95%, excellent for those days. But finding and correcting the gaps and mistakes was still a time consuming task.

It appears the New York Times (NYT) is undertaking the digitisation of its archives: with older print material, the success rate can be as low as 80% accuracy.
The Guardian, in its Technology section, reports on a truly innovative approach to this problem which made me shout for joy – because in tackling it, a quite different problem is harnessed and both are solved. Each problem becomes part of the solution to the other. How elegant is that?

Luis van Ahn of Carnegie Mellon University invented the Captcha technique that many online services use to defeat spambots. It’s good, but not entirely effective (watch his online video to see why not). Now, van Ahn has harnessed the NYT’s problem to become part of the Captcha solution, and in doing so is also solving the NYT’s problem. In effect, every user who signs up using the new reCaptcha becomes a Mechanical Turk for the NYT (or the Internet Archive, also being tackled).

van Ahn says he works on “Human Computation, which harnesses the combined computational power of humans and computers to solve large-scale problems”. reCaptcha is a classic of this approach. It will display two “puzzles” from the NYT’s digitisation project. One will be a word whose correct digitisation is already known. The other will be one the software has failed to analyse. Fairly obviously, it’s the first one whose correct interpretation is the key to being permitted to sign up. But when several people have provided the same interpretation of the second, so far unsolved, word then two things happen.

First, it is added to the library of “known” puzzles. This means that the puzzles presented to the users are real puzzles, likely to be far less easy for spambots to solve. After all, it’s already been demonstrated that quality OCR software finds them hard to analyse

But second, the solution is returned to the NYT project as a presumed correct interpretation of that segment of the scanned text – so the NYT project progresses.

For me, it’s this synergy of two apparently unrelated puzzle tasks which is the beauty and the elegance of the solution. Who says IT people can’t be creative?!

• Antispam weapon recaptures lost text: The Guardian, 27 Nov 2008
• reCaptcha: look at the Learn More page for uses of reCaptcha (e.g. on your blog)
• reCaptcha’s sister application Mailhide (uses reCaptcha to secure your email address on the web)
• Amazon’s Mechanical Turk
• Luis van Ahn at CMU: click the Video link for a video of a presentation he made at Google (52 minutes)