Technology Review - Published By MIT
Advertisement

How to Digitize a Million Books

Continued from page 1

By Kate Greene

Tuesday, February 28, 2006

smaller text tool iconmedium text tool iconlarger text tool icon

Ultimately, Clancy says, Google would like Book Search to give the same result as someone going to a library, looking in its stacks, and serendipitously finding a book that's interesting or useful. One way to do this would be to link books to each other by categories and themes, he suggests. The task becomes more complicated, though, when linking works by Virginia Woolf, for instance, to criticisms of her work, works that inspired her, or authors who wrote during the same era. Designing algorithms that can effectively organize all of this new information, Clancy says, is "one of the grand challenges and will take many years."

Reddy says CMU researchers are trying to tackle this challenge by using a "statistical approach" to organizing the information. In this approach, Virginia Woolf's stream-of-consciousness sentences, for example, would be analyzed by an algorithm that would find patterns based on sentence length, structure, and punctuation. This technique might find a work by James Joyce, one of Woolf's influences -- or that of an obscure author whose writings might otherwise never have been found.

In the meantime, researchers are seeking shortcuts for searching among authors, books, and genres, Reddy says. Similar to the way "collaborative filtering" at Amazon uses people's past purchases to help others find potential purchases, Book Search users could help each other. The community-based approach is an idea that Google has not announced, Clancy says, but it could add another layer to searching through books and create grassroots excitement about the project.

Certainly, holding its cards close is not new for Google. "They are secretive about almost everything they do...this is very common with Silicon Valley companies," Reddy at CMU says. In the case of Book Search, he says, Google wants "to have a captive solution for all the libraries." Even so, though, Reddy is excited about Google's project and believes it will eventually complement his research. "I'm sure at some point they will have a pointer to our books," he says.

Meanwhile, Google has to contend with the nontechnical issue of disgruntled copyright holders dragging them into court. The Author's Guild and a number of publishers have sued them, claiming that Google's project violates copyright law. (Stanford professor Lawrence Lessig has made a 30-minute video about the legal controversy.)

But if the legal and technical challenges can be overcome, digitized physical books could greatly surpass the billions of existing Web pages in breadth and depth of information. Indeed, a single comprehensive online card catalog for millions of the world's books has the potential to create a whole new chapter in the information age.

Comments

  • Digitization errors
    Google Scholar OCR of old journals has a problem which often converts "modern" to "modem". Google is not using any content processing to detect silly uses of "modem".
    Rate this comment: 12345
    Guest (Rdmoore6)
    02/28/2006
    Posts:1
  • Chinese character indexing
    This article is extremely interesting.
    I hope my Chinese Character indexing system will tie in with this kind of on-line book database in the future.
    Rate this comment: 12345
    Guest (w wong)
    02/28/2006
    Posts:1
  • U.S. Government Documents available online
    LexisNexis is currently digitizing millions of pages of declassified documents from all three branches of the U.S. government going back as far as 200 years.  These documents are being made available in searchable PDF documents that have been fully abstracted and indexed.  It's amazing the quantity and quality of information they are publishing online.
    Rate this comment: 12345
    Guest (Doc_Reader)
    02/28/2006
    Posts:1
  • Searching Scanned Books
    The article "How to Digitize a Million Books" discusses the issue of Searching thru the scanned books, but does not touch the currently popular way of indexing the books by their "entire text", so that any word or a string of characters can be used to search books, instead of key word searches.
    Rate this comment: 12345
    Guest (Chandra Sekhar S.)
    03/02/2006
    Posts:1
  • wanted a topological transformation compressor
    with the right ratio and a 40gig
    dll all stuff can be stored magazines books music other communication forms
    Rate this comment: 12345
    Guest (lawrephord24@hotmail.com)
    03/05/2006
    Posts:1
  • Google Book Search aproach is not so special
    You can find an article about it at the folowing adress: http://students.haverford.edu/jhuttner/Essays/Computers/GoogleBookSearch.htm
    To be more specific I can tell you that they use some specialized book scanners from Kirtas Technologies. Those scanners turn the pages automaticaly and have a decent scanning speed. You can use only one operator for two or three scanners working in paralel. Very eficient. Kirtas is not alone in this field but I supose this is another discusion.
    If you are wondering what OCR they use is very simple .... you can use it too ... is ABBY Fine Reader.
    Hope the info is usefull
    Rate this comment: 12345
    Guest (Mpaunescu)
    03/08/2006
    Posts:1
  • Mass - Digitization with Revolutionary techonology without breaking the spine or damaging the book
    We are the pioneers to do the mass digitization of books, journals, periodicals in India with a special revolutionary techonology on a large basis. The best part is your originals will not be damaged and even we dont need to break the spine of the book.

    We have already digitized millions of pages. Our mission is to store the books and literature for years together for the next 100's of generation to come. We also OCR them so as to make it searchable.

    In this article even Clancy says, "in particular, with page numbering. For instance, full pages can be missing or dog-eared corners could reveal an incorrect page number. And if pagination is wrong in one part of the book, the error propagates throughout the work."

    We have overcomed with that problem with the technology we are using.

    Rutul Kamdar
    Director
    DigiSys Info Service Pvt. Ltd.
    www.digisysglobe.com
    digisysglobe@gmail.com
    Rate this comment: 12345

    digisysglobe
    08/21/2008
    Posts:1

Log In

Forgot your password?     Register »
Advertisement

Videos

Laser-Triggered Chemical Reactions
Featured Content
Sponsored by:
White Papers

Twelve ways to reduce costs with SQL Server 2008
Find out how to reduce costs and get more efficient

Download

Total Economic Impact of SQL Server 2008 Upgrade
Forrester reports on increasing productivity and management capabilities

Download 

Achieving Cost and Resource Savings with UC
How Office Communications Server R2 and Exchange Server can make your business smarter and more efficient

Download 

The Compelling Case for Conferencing
Read how you can improve workload support and find IT efficiencies

Download

How Windows Server 2008 R2 Helps Optimize IT and Save you Money
Read how you can improve workload support and find IT efficiencies

Download

Windows Server 2008 R2 Hyper-V Live Migration
See how Windows Server 2008 R2 and Hyper-V enable virtualization and Live Migration

Download
Advertisement
Subscribe to Technology Review's daily e-mail update. Enter your e-mail address

TECHNOLOGY RESOURCES
Advertisement
MIT Massachusetts Institute of Technology © 2009 Technology Review. All Rights Reserved.