Technology Review - Published By MIT
Advertisement
TR35

2007 Young Innovator

Luis von Ahn, 29

Carnegie Mellon University

Using “captchas” to digitize books

Courtesy Luis von Ahn

Luis von Ahn is a pioneer of "captchas"--those strings of distorted characters that websites force you to recognize and type in order to establish that you are a person and not a malevolent computer. But he finds the technology's success a mixed blessing. "At first I was feeling quite proud of myself," says von Ahn, a 2006 MacArthur "genius grant" recipient who created captchas (an acronym for "completely automated public Turing test to tell computers and humans apart") for Yahoo in 2000 to thwart automated e-mail account registration, a tool of spammers. "But then I was feeling bad, because every time you solve a captcha, you waste 10 seconds." People around the world solve an estimated 60 million captchas every day, adding up to more than 150,000 wasted hours.


Von Ahn, an assistant professor of computer science, is a leader in using human skills to make computers work better. For example, he created an online game in which players identify elements in photographs; their answers help improve image-search algorithms. He's now trying to put captchas to work in one of the epic efforts of the information age: digitizing millions of old books and making them searchable online.


An estimated 8 percent of words in these old books can't be read by the optical character recognition (OCR) software used to scan them. Von Ahn has teamed with the nonprofit Internet Archive to use captchas to help interpret those words. After all, he says, "while you are solving a captcha, you are solving a task that computers can't perform." So he created a tool, called ­"recaptcha," that pairs an unknown word with a known one. He distorts them both and puts a line through them--standard techniques for creating captchas. A user must decipher both captchas to access a site. The accurate typing of the known word serves the security purpose of captchas and adds a measure of confidence that the unknown word was identified correctly and can be used in place of the OCR's gibberish. Volunteers have begun deploying recaptchas, and the technique has been used to decipher two million words for the Internet Archive's book digitization effort. Recaptchas tap the joint power of people, networks, and computers in a way that should have a big impact, says Brewster Kahle, an Internet entre­preneur and cofounder of the archive: "It is like an army of ants building the Taj Mahal."



Credit: reCAPTCHA

This image illustrates the difficulty that optical-character-recognition software can have in interpreting the content of older books. Luis von Ahn's recaptcha project is designed to help replace the OCR gibberish with the actual words.


--David Talbot

 
 
TR35 Back to all TR35 2007 Winners   TR35 2007 Infotech Winners     
Sanjit Biswas
Cheap, easy Internet access
Josh Bongard
Adaptive robots
Garrett Camp
Discovering more of the Web
Mung Chiang
Optimizing networks
Tadayoshi Kohno
Securing systems cryptographically
Tariq Krim
Building a personal, dynamic Web page

Ivan Krstic´
Making antivirus software obsolete
Jeff LaPorte
Internet-based calling from mobile phones
Karen Liu
Bringing body language to computer-animated characters
Anna Lysyanskaya
Securing online privacy
Tapan Parikh
Simple, powerful mobile tools for developing economies
Babak Parviz
Self-assembling micromachines
Partha Ranganathan
Power-aware computing systems

Kevin Rose
Online social bookmarking
Marc Sciamanna
Controlling chaos in telecom lasers
Desney Tan
Teaching computers to read minds
Luis von Ahn
Using “captchas” to digitize books
Mark Zuckerberg
Circle of friends

Comments

Advertisement
MIT Massachusetts Institute of Technology © 2009 Technology Review. All Rights Reserved.