reCaptcha uses one problem to crack another 27 Nov 2008
Posted by Tony Law in Tech Watch, Technorati.Tags: human, reCaptcha, security
3 comments
You’ve heard of Amazon’s Mechanical Turk function call, which provides a software interface to human agents? A call to the Turk offers a task – anything a human might do and a computer can’t, like deciding “Does this image include a face?” – to the community, and acts as agent for any appropriate payment as well.
And you’re familiar too with the grid approach to large scale problems, farming out small components of a task to large numbers of distributed PCs and using their idle cycles to contribute to the task.
One task for the Turk might be interpreting the parts of scanned text which Optical Character Recognition (OCR) software fails to successfully digitise. Some years ago I was involved with scanning in the American Petroleum Institute thesaurus: a very large volume. We had a success rate of about 95%, excellent for those days. But finding and correcting the gaps and mistakes was still a time consuming task.
It appears the New York Times (NYT) is undertaking the digitisation of its archives: with older print material, the success rate can be as low as 80% accuracy.
The Guardian, in its Technology section, reports on a truly innovative approach to this problem which made me shout for joy – because in tackling it, a quite different problem is harnessed and both are solved. Each problem becomes part of the solution to the other. How elegant is that?
Luis van Ahn of Carnegie Mellon University invented the Captcha technique that many online services use to defeat spambots. It’s good, but not entirely effective (watch his online video to see why not). Now, van Ahn has harnessed the NYT’s problem to become part of the Captcha solution, and in doing so is also solving the NYT’s problem. In effect, every user who signs up using the new reCaptcha becomes a Mechanical Turk for the NYT (or the Internet Archive, also being tackled).
van Ahn says he works on "Human Computation, which harnesses the combined computational power of humans and computers to solve large-scale problems". reCaptcha is a classic of this approach. It will display two "puzzles" from the NYT’s digitisation project. One will be a word whose correct digitisation is already known. The other will be one the software has failed to analyse. Fairly obviously, it’s the first one whose correct interpretation is the key to being permitted to sign up. But when several people have provided the same interpretation of the second, so far unsolved, word then two things happen.
First, it is added to the library of "known" puzzles. This means that the puzzles presented to the users are real puzzles, likely to be far less easy for spambots to solve. After all, it’s already been demonstrated that quality OCR software finds them hard to analyse
But second, the solution is returned to the NYT project as a presumed correct interpretation of that segment of the scanned text – so the NYT project progresses.
For me, it’s this synergy of two apparently unrelated puzzle tasks which is the beauty and the elegance of the solution. Who says IT people can’t be creative?!
Links:
• Antispam weapon recaptures lost text: The Guardian, 27 Nov 2008
• reCaptcha: look at the Learn More page for uses of reCaptcha (e.g. on your blog)
• reCaptcha’s sister application Mailhide (uses reCaptcha to secure your email address on the web)
• Amazon’s Mechanical Turk
• Luis van Ahn at CMU: click the Video link for a video of a presentation he made at Google (52 minutes)
Windows Azure – missing from the blogsphere? 19 Nov 2008
Posted by Tony Law in Managing IT, Tech Watch, Technorati.Tags: Azure, cloud computing, distributed computing
add a comment
It’s a month and more since Ray Ozzie announced it. But there seems to be remarkably little independent comment in the blogsphere about Windows Azure, Microsoft’s push into the world of cloud computing. The exception is TechCrunch (links below) and a link by lifehacker to a C|net News report. All credit to them: C|net haven’t just reproduced the Microsoft press release, but given us an account of Ray Ozzie’s announcement presentation at the Professional Developers’ Conference.
I’d better be careful what I mean by “independent” comment. The vast majority of the top hits in a Google blog search go back to blogs specifically concerned with Microsoft Windows or Vista. That’s not suggesting that Microsoft drives their content, or to decry the quality of their comments either. But it means that the wider world has perhaps missed the point. Remember when Microsoft finally “got it” about the Internet? It was playing catch-up, but it did so within the space of about six months.
Cloud computing already has its giants. Amazon led the high-profile way (with AWS, S3, EC2 and so on) but there are other Silicon Valley companies such as 3Tera, who I visited in Silicon Valley a couple of years ago. Major hardware vendors are making their announcements too (TechCrunch again, tracking the HP/Intel/Yahoo! announcement in July). But remember that Ozzie, with Lotus Notes to his name, is a veteran of the distributed services concept. And if Azure lives up to the expectation of being Windows on distributed steroids, then it is more than likely to be far more important than its coverage to date would suggest.
Watch this space!
Links:
• Microsoft Azure website
• 3Tera
• Ozzie, Muglia, and Srivastava on Windows Azure TechCrunch, 27 Oct 2008 (there’s a video embedded in this report)
• Microsoft Unveils Windows Azure at Professional Developers Conference Microsoft PressPass, 27 Oct 2008
• Ray Ozzie on Azure, Office unchained, and Openness TechCrunch, 29 Oct 2008
• Microsoft launches Windows Azure C|net news, 27 Oct 2008
• Windows Azure unmasked ITPro, 30 Oct 2008
InformationSpan Report series launched 12 Nov 2008
Posted by Tony Law in Insight services, Managing IT, Technorati.Tags: AMR Research, BI, BI Survey, Business Intelligence, Butler Group, Forrester, Gartner, IDC, IT Toolbox, OLAP Report, reports, Ventana Research
add a comment
InformationSpan has begun to create a series of survey reports which will look at Insight Services coverage of specific IT topic areas.
The first Report surveys insight services for Business Intelligence. Recent significant consolidation in the BI marketplace makes authoritative advice essential in this area: reports prior to mid 2007 are likely to be very dated. We review providers with known coverage in this area, from the InformationSpan database of over 400 providers, using our industry structure model for classification: global generalists; global specialists; local generalists; and niche providers.
You can view this first report for free: go to the website and click on the new tab labelled “InformationSpan Reports”. Even if BI isn’t your primary area, you might like to see the approach. Comments will be welcome here, particularly if you think I’ve missed something!
I’m planning one report a month from now on; current planned coverage includes the Emerging Technology agenda; Risk Management; and Merger & Acquisition Support. If you would like to influence this agenda, or commission a special report, do get in touch!
Once again, no other Links in this posting.