Hello,

We noticed you're browsing in private or incognito mode.

To continue reading this article, please exit incognito mode or log in.

Not an Insider? Subscribe now for unlimited access to online articles.

Researchers use Web registration tool to digitize books

PITTSBURGH (AP) – Researchers at Carnegie Mellon University have discovered a way to enlist people across the globe to help digitize books every time they solve the simple distorted word puzzles commonly used to register at Web sites or buy things online.

The word puzzles are known as CAPTCHAs, short for ”completely automated public Turing tests to tell computers and humans apart.” Computers cannot decipher the twisted letters and numbers, ensuring that real people and not automated programs are using the Web sites.

Researchers estimate that about 60 million of those nonsensical jumbles are solved everyday around the world, taking an average of about 10 seconds each to decipher and type in.

Instead of wasting time typing in random letters and numbers, Carnegie Mellon researchers have come up with a way for people to type in snippets of books to put their time to good use, confirm they are not machines and help speed up the process of getting searchable texts online.

”Humanity is wasting 150,000 hours every day on these,” said Luis von Ahn, an assistant professor of computer science at Carnegie Mellon. He helped develop the CAPTCHAs about seven years ago. ”Is there any way in which we can use this human time for something good for humanity, do 10 seconds of useful work for humanity?”

Many large projects are under way now to digitize books and put them online, and that is mostly being done by scanning pages of books so that people can ”page through” the books online. In some cases, optical character recognition, or OCR, is being used to digitize books to make the texts searchable.

But von Ahn said OCR does not always work on text that is older, faded or distorted. In those cases, often the only way to digitize the works is to manually type them into a computer.

Von Ahn is working with the Internet Archive, which runs several book-scanning projects, to use CAPTCHAs for this instead. Internet Archive scans 12,000 books a month and sends von Ahn hundreds of thousands of files that are images that the computer doesn’t recognize. Those files are downloaded onto von Ahn’s server and split up into single words that can be used as CAPTCHAs at sites all over the Internet.

If enough users decipher the CAPTCHAs in the same way, the computer will recognize that as the correct answer.

”If we can correct these books so that they are really in good shape, then you can go and use these books in other type devices more easily” such as handheld computers or in programs for reading to the blind, said Brewster Kahle, co-founder of the Internet Archive.

Von Ahn approached the Internet Archive to get help in developing the new system, but it has not been put into use yet. Theoretically, von Ahn said the new book-based CAPTCHAs could be used in place of any CAPTCHA currently on the Web.

The project, named reCAPTCHA, is one of many projects that enlist computer users from the community to help out. For example, Cloudmark Inc. uses its base of users to judge what is spam and what is not. News aggregation sites like Digg Inc.’s digg.com and Time Warner Inc.’s Netscape.com ask visitors to recommend and vote on items to go on top.

For von Ahn’s project, Intel Corp. donated equipment and the work was sponsored by the MacArthur Foundation, which awarded von Ahn a ”genius grant” last year.

Kahle, whose Internet Archive has about 200,000 books currently online, is working with libraries in three countries to digitize their books. Kahle said von Ahn’s project is ”harnessing human power in exactly the right way.”

”It’s definitely a barn-raising to try to build the great library,” Kahle said.

——

On the Net:

http://www.recaptcha.net

http://www.gutenberg.org

Tech Obsessive?
Become an Insider to get the story behind the story — and before anyone else.

Subscribe today
Want more award-winning journalism? Subscribe and become an Insider.
  • Insider Plus {! insider.prices.plus !}* Best Value

    {! insider.display.menuOptionsLabel !}

    Everything included in Insider Basic, plus the digital magazine, extensive archive, ad-free web experience, and discounts to partner offerings and MIT Technology Review events.

    See details+

    Print + Digital Magazine (6 bi-monthly issues)

    Unlimited online access including all articles, multimedia, and more

    The Download newsletter with top tech stories delivered daily to your inbox

    Technology Review PDF magazine archive, including articles, images, and covers dating back to 1899

    10% Discount to MIT Technology Review events and MIT Press

    Ad-free website experience

  • Insider Basic {! insider.prices.basic !}*

    {! insider.display.menuOptionsLabel !}

    Six issues of our award winning print magazine, unlimited online access plus The Download with the top tech stories delivered daily to your inbox.

    See details+

    Print Magazine (6 bi-monthly issues)

    Unlimited online access including all articles, multimedia, and more

    The Download newsletter with top tech stories delivered daily to your inbox

  • Insider Online Only {! insider.prices.online !}*

    {! insider.display.menuOptionsLabel !}

    Unlimited online access including articles and video, plus The Download with the top tech stories delivered daily to your inbox.

    See details+

    Unlimited online access including all articles, multimedia, and more

    The Download newsletter with top tech stories delivered daily to your inbox

/3
You've read of three free articles this month. for unlimited online access. You've read of three free articles this month. for unlimited online access. This is your last free article this month. for unlimited online access. You've read all your free articles this month. for unlimited online access. You've read of three free articles this month. for more, or for unlimited online access. for two more free articles, or for unlimited online access.