reCaptcha uses one problem to crack another

You’ve heard of Amazon’s Mechanical Turk function call, which provides a software interface to human agents? A call to the Turk offers a task – anything a human might do and a computer can’t, like deciding “Does this image include a face?” – to the community, and acts as agent for any appropriate payment as well.

And you’re familiar too with the grid approach to large scale problems, farming out small components of a task to large numbers of distributed PCs and using their idle cycles to contribute to the task.

One task for the Turk might be interpreting the parts of scanned text which Optical Character Recognition (OCR) software fails to successfully digitise. Some years ago I was involved with scanning in the American Petroleum Institute thesaurus: a very large volume. We had a success rate of about 95%, excellent for those days. But finding and correcting the gaps and mistakes was still a time consuming task.

It appears the New York Times (NYT) is undertaking the digitisation of its archives: with older print material, the success rate can be as low as 80% accuracy.
The Guardian, in its Technology section, reports on a truly innovative approach to this problem which made me shout for joy – because in tackling it, a quite different problem is harnessed and both are solved. Each problem becomes part of the solution to the other. How elegant is that?

Luis van Ahn of Carnegie Mellon University invented the Captcha technique that many online services use to defeat spambots. It’s good, but not entirely effective (watch his online video to see why not). Now, van Ahn has harnessed the NYT’s problem to become part of the Captcha solution, and in doing so is also solving the NYT’s problem. In effect, every user who signs up using the new reCaptcha becomes a Mechanical Turk for the NYT (or the Internet Archive, also being tackled).

van Ahn says he works on “Human Computation, which harnesses the combined computational power of humans and computers to solve large-scale problems”. reCaptcha is a classic of this approach. It will display two “puzzles” from the NYT’s digitisation project. One will be a word whose correct digitisation is already known. The other will be one the software has failed to analyse. Fairly obviously, it’s the first one whose correct interpretation is the key to being permitted to sign up. But when several people have provided the same interpretation of the second, so far unsolved, word then two things happen.

First, it is added to the library of “known” puzzles. This means that the puzzles presented to the users are real puzzles, likely to be far less easy for spambots to solve. After all, it’s already been demonstrated that quality OCR software finds them hard to analyse

But second, the solution is returned to the NYT project as a presumed correct interpretation of that segment of the scanned text – so the NYT project progresses.

For me, it’s this synergy of two apparently unrelated puzzle tasks which is the beauty and the elegance of the solution. Who says IT people can’t be creative?!

• Antispam weapon recaptures lost text: The Guardian, 27 Nov 2008
• reCaptcha: look at the Learn More page for uses of reCaptcha (e.g. on your blog)
• reCaptcha’s sister application Mailhide (uses reCaptcha to secure your email address on the web)
• Amazon’s Mechanical Turk
• Luis van Ahn at CMU: click the Video link for a video of a presentation he made at Google (52 minutes)

Windows Azure – missing from the blogsphere?

It’s a month and more since Ray Ozzie announced it. But there seems to be remarkably little independent comment in the blogsphere about Windows Azure, Microsoft’s push into the world of cloud computing. The exception is TechCrunch (links below) and a link by lifehacker to a C|net News report. All credit to them: C|net haven’t just reproduced the Microsoft press release, but given us an account of Ray Ozzie’s announcement presentation at the Professional Developers’ Conference.

I’d better be careful what I mean by “independent” comment. The vast majority of the top hits in a Google blog search go back to blogs specifically concerned with Microsoft Windows or Vista. That’s not suggesting that Microsoft drives their content, or to decry the quality of their comments either. But it means that the wider world has perhaps missed the point. Remember when Microsoft finally “got it” about the Internet? It was playing catch-up, but it did so within the space of about six months.

Cloud computing already has its giants. Amazon led the high-profile way (with AWS, S3, EC2 and so on) but there are other Silicon Valley companies such as 3Tera, who I visited in Silicon Valley a couple of years ago. Major hardware vendors are making their announcements too (TechCrunch again, tracking the HP/Intel/Yahoo! announcement in July). But remember that Ozzie, with Lotus Notes to his name, is a veteran of the distributed services concept. And if Azure lives up to the expectation of being Windows on distributed steroids, then it is more than likely to be far more important than its coverage to date would suggest.

Watch this space!

• Microsoft Azure website
• Ozzie, Muglia, and Srivastava on Windows Azure TechCrunch, 27 Oct 2008 (there’s a video embedded in this report)
• Microsoft Unveils Windows Azure at Professional Developers Conference Microsoft PressPass, 27 Oct 2008
• Ray Ozzie on Azure, Office unchained, and Openness TechCrunch, 29 Oct 2008
• Microsoft launches Windows Azure C|net news, 27 Oct 2008
• Windows Azure unmasked ITPro, 30 Oct 2008

InformationSpan Report series launched

InformationSpan has begun to create a series of survey reports which will look at Insight Services coverage of specific IT topic areas.

The first Report surveys insight services for Business Intelligence. Recent significant consolidation in the BI marketplace makes authoritative advice essential in this area: reports prior to mid 2007 are likely to be very dated. We review providers with known coverage in this area, from the InformationSpan database of over 400 providers, using our industry structure model for classification: global generalists; global specialists; local generalists; and niche providers.

You can view this first report for free: go to the website and click on the new tab labelled “InformationSpan Reports”. Even if BI isn’t your primary area, you might like to see the approach. Comments will be welcome here, particularly if you think I’ve missed something!

I’m planning one report a month from now on; current planned coverage includes the Emerging Technology agenda; Risk Management; and Merger & Acquisition Support. If you would like to influence this agenda, or commission a special report, do get in touch!

Once again, no other Links in this posting.

Providers: how accessible is your meta-information?

I’ve been doing research to update an InformationSpan coverage report on Business Intelligence. I’m struck by the different approaches of providers that help, or hinder, this task.

Remember – I’m not trying to read the content necessarily (though some elements of it are useful). I’m coming at it from the perspective of an enterprise trying to find the best insight services provider for their needs. So I’m trying to find out the depth of their coverage, how important the topic is for them, and how up to date they are. What I really want is meta-information: information about information.

To show what I mean, here are a handful of case studies.

Analyst firm A – a well known global specialist – offers me a guest account. Even without this, I can explore in reasonable depth just using the search box on the home page and the About section of the website. Then, when I sign in, I can see the whole structure of their website as a paying client would see it. I can browse the analyst biographies, undertake searches, and in fact do pretty much anything a client would do except read the premium content research. So I have a pretty good idea how many analysts cover this area, which of them I’ve heard of or encountered, and from the abstracts of the published reports I can see at least some of the companies covered in their writing. I think I can make a pretty good assessment of their coverage, based on what I’ve seen.

Company B has a similar model, and their guest login provides a lot of offers of complimentary access to the full text of quite recent reports – cleverly, using a Flash Player presentation which means there’s no downloadable version but you can see the whole thing. Even better!

Analyst firm C – a global generalist – has a great deal of good content. In fact, BI is one of their primary coverage areas. But if I didn’t know, I might well not realise it. I can see a fully featured non-client home page. But there’s no search box on it, so it’s rather difficult to assess what they’ve got. It turns out that the best route into the information I need is to browse the Analysts section, because their people are indexed by coverage area; or to go through Events. There are links there to reports, but only the barest abstract is visible as a non-client. I did find a key report via a vendor’s website (though I had to use Google to find it) and in the end, from prior knowledge and reasonably successful online research, again I have a fair idea of their coverage. But it was a lot harder work!

Provider D may or may not have content. But since the only information I can find from their web pages, as a non-member, is about the structure of their service and the highest level information about content, and there’s no search function without a login, I simply don’t know!

So if I’m an enterprise trying to find specific coverage, across the marketplace of four hundred or so general and specialist providers, guess which providers simply aren’t going to figure? They can’t be assessed, and they won’t make the cut.

Providers – please, at least, provide a search box so that we can see what you do!

PS – you can see the report via the InformationSpan website. Click the new tab section for Reports. No other links on this posting!