Technology Review - Published By MIT
Advertisement

How to Digitize a Million Books

Needed: scanning software for 430 languages and a system to organize the next big leap in the information age.

By Kate Greene

Tuesday, February 28, 2006

smaller text tool iconmedium text tool iconlarger text tool icon

Fifteen months after Google announced a book-scanning project of biblical proportions -– an effort to digitize the entire book collections of the New York Public Library and Harvard University libraries, among others -- the company is still secretive about how they are solving key technical problems and won't say how much they've accomplished so far.

However, a similar if smaller project -- the Million Book Project at Carnegie Mellon University in Pittsburgh -- has been underway for about seven years. It could provide some clues. The project's director, computer scientist Raj Reddy, says he and his colleagues have no more knowledge about Google's methods or progress than anyone else, but they are tackling many of the same challenges.

The goal of Google Book Search is to make all offline books -- currently invisible to Google's eye -- searchable. This means physically scanning hundreds of millions of pages bound between the covers of an estimated 18 million books, recognizing around 430 languages and all sorts of fonts, making the results available for text searches, and replicating the traditional library browsing experience when it's all done. Daniel Clancy, engineering director for Google Book Search, says he cannot comment on what the company has accomplished so far.

In the CMU project, though, the scanning technology is off the shelf. They're using readily available Minolta PS 7000 book scanners set up at 40 scanning stations in India and China, where the local governments are helping to keep the costs low for the nonprofit project. In this setup, workers manually turn each page. Seven years into the project, around 600,000 books (mostly public-domain works shipped from around the world) have been scanned, and every day another 100,000 pages join the digital corpus. At this rate, it could take just under five years to complete the CMS project.

In contrast, Clancy says Google has developed its own scanning technology. But the company is mum about the technical details of the hardware, optical-character recognition (OCR) software, and scanning rate at their five scanning centers near cooperating library partners at Harvard, Stanford, University of Michigan, University of Oxford, and New York Public Library.

Reddy says commercially available software for recognizing English works well for the Million Book Project. The challenges they face with OCR are being addressed by their Chinese partners, who are developing specific software to better recognize unconventional fonts and calligraphic scripts often found in older books. Additionally, their partners in Egypt are developing OCR for Arabic. Right now, Reddy says, OCR is an active research area in which many countries are contributing expertise.

Once books are scanned and their texts accessible, the major challenge is making the text useful for searching. The inconsistency in physical quality of books can cause problems, Clancy says, in particular, with page numbering. For instance, full pages can be missing or dog-eared corners could reveal an incorrect page number. And if pagination is wrong in one part of the book, the error propagates throughout the work.

This problem is being overcome, CMU's Reddy says, by designing software that does not rely on page numbers. Instead, it creates "structural metadata," which are basically tags that summarize the meaning of information within a book, so that researchers can link words in the table of contents with corresponding chapters. Additionally, indexed terms can be linked to the correct passages. Unfortunately, says Reddy, establishing the links is still a manual process; no one has developed software that can establish these hyperlinks with more than about 90 percent accuracy. If the technique can be honed, though, it could make text searches more meaningful.

Comments

  • Digitization errors
    Google Scholar OCR of old journals has a problem which often converts "modern" to "modem". Google is not using any content processing to detect silly uses of "modem".
    Rate this comment: 12345
    Guest (Rdmoore6)
    02/28/2006
    Posts:1
  • Chinese character indexing
    This article is extremely interesting.
    I hope my Chinese Character indexing system will tie in with this kind of on-line book database in the future.
    Rate this comment: 12345
    Guest (w wong)
    02/28/2006
    Posts:1
  • U.S. Government Documents available online
    LexisNexis is currently digitizing millions of pages of declassified documents from all three branches of the U.S. government going back as far as 200 years.  These documents are being made available in searchable PDF documents that have been fully abstracted and indexed.  It's amazing the quantity and quality of information they are publishing online.
    Rate this comment: 12345
    Guest (Doc_Reader)
    02/28/2006
    Posts:1
  • Searching Scanned Books
    The article "How to Digitize a Million Books" discusses the issue of Searching thru the scanned books, but does not touch the currently popular way of indexing the books by their "entire text", so that any word or a string of characters can be used to search books, instead of key word searches.
    Rate this comment: 12345
    Guest (Chandra Sekhar S.)
    03/02/2006
    Posts:1
  • wanted a topological transformation compressor
    with the right ratio and a 40gig
    dll all stuff can be stored magazines books music other communication forms
    Rate this comment: 12345
    Guest (lawrephord24@hotmail.com)
    03/05/2006
    Posts:1
  • Google Book Search aproach is not so special
    You can find an article about it at the folowing adress: http://students.haverford.edu/jhuttner/Essays/Computers/GoogleBookSearch.htm
    To be more specific I can tell you that they use some specialized book scanners from Kirtas Technologies. Those scanners turn the pages automaticaly and have a decent scanning speed. You can use only one operator for two or three scanners working in paralel. Very eficient. Kirtas is not alone in this field but I supose this is another discusion.
    If you are wondering what OCR they use is very simple .... you can use it too ... is ABBY Fine Reader.
    Hope the info is usefull
    Rate this comment: 12345
    Guest (Mpaunescu)
    03/08/2006
    Posts:1
  • Mass - Digitization with Revolutionary techonology without breaking the spine or damaging the book
    We are the pioneers to do the mass digitization of books, journals, periodicals in India with a special revolutionary techonology on a large basis. The best part is your originals will not be damaged and even we dont need to break the spine of the book.

    We have already digitized millions of pages. Our mission is to store the books and literature for years together for the next 100's of generation to come. We also OCR them so as to make it searchable.

    In this article even Clancy says, "in particular, with page numbering. For instance, full pages can be missing or dog-eared corners could reveal an incorrect page number. And if pagination is wrong in one part of the book, the error propagates throughout the work."

    We have overcomed with that problem with the technology we are using.

    Rutul Kamdar
    Director
    DigiSys Info Service Pvt. Ltd.
    www.digisysglobe.com
    digisysglobe@gmail.com
    Rate this comment: 12345

    digisysglobe
    08/21/2008
    Posts:1

Log In

Forgot your password?     Register »
Advertisement

Videos

Malleable Maps, Artistic Robots and Bubble Interfaces
Technology Review January/February 2010

Current Issue

Security in the Ether
Information technology's next grand challenge will be to secure the cloud--and prove we can trust it.
Advertisement
Advertisement
Advertisement
Subscribe to Technology Review's daily e-mail update. Enter your e-mail address

TECHNOLOGY RESOURCES
Advertisement
MIT Massachusetts Institute of Technology © 2010 Technology Review. All Rights Reserved.