How to Digitize a Million Books

Needed: scanning software for 430 languages and a system to organize the next big leap in the information age.

Kate Greenearchive page

February 28, 2006

Fifteen months after Google announced a book-scanning project of biblical proportions -– an effort to digitize the entire book collections of the New York Public Library and Harvard University libraries, among others – the company is still secretive about how they are solving key technical problems and won’t say how much they’ve accomplished so far.

However, a similar if smaller project – the Million Book Project at Carnegie Mellon University in Pittsburgh – has been underway for about seven years. It could provide some clues. The project’s director, computer scientist Raj Reddy, says he and his colleagues have no more knowledge about Google’s methods or progress than anyone else, but they are tackling many of the same challenges.

The goal of Google Book Search is to make all offline books – currently invisible to Google’s eye – searchable. This means physically scanning hundreds of millions of pages bound between the covers of an estimated 18 million books, recognizing around 430 languages and all sorts of fonts, making the results available for text searches, and replicating the traditional library browsing experience when it’s all done. Daniel Clancy, engineering director for Google Book Search, says he cannot comment on what the company has accomplished so far.

In the CMU project, though, the scanning technology is off the shelf. They’re using readily available Minolta PS 7000 book scanners set up at 40 scanning stations in India and China, where the local governments are helping to keep the costs low for the nonprofit project. In this setup, workers manually turn each page. Seven years into the project, around 600,000 books (mostly public-domain works shipped from around the world) have been scanned, and every day another 100,000 pages join the digital corpus. At this rate, it could take just under five years to complete the CMS project.

In contrast, Clancy says Google has developed its own scanning technology. But the company is mum about the technical details of the hardware, optical-character recognition (OCR) software, and scanning rate at their five scanning centers near cooperating library partners at Harvard, Stanford, University of Michigan, University of Oxford, and New York Public Library.

Reddy says commercially available software for recognizing English works well for the Million Book Project. The challenges they face with OCR are being addressed by their Chinese partners, who are developing specific software to better recognize unconventional fonts and calligraphic scripts often found in older books. Additionally, their partners in Egypt are developing OCR for Arabic. Right now, Reddy says, OCR is an active research area in which many countries are contributing expertise.

Once books are scanned and their texts accessible, the major challenge is making the text useful for searching. The inconsistency in physical quality of books can cause problems, Clancy says, in particular, with page numbering. For instance, full pages can be missing or dog-eared corners could reveal an incorrect page number. And if pagination is wrong in one part of the book, the error propagates throughout the work.

This problem is being overcome, CMU’s Reddy says, by designing software that does not rely on page numbers. Instead, it creates “structural metadata,” which are basically tags that summarize the meaning of information within a book, so that researchers can link words in the table of contents with corresponding chapters. Additionally, indexed terms can be linked to the correct passages. Unfortunately, says Reddy, establishing the links is still a manual process; no one has developed software that can establish these hyperlinks with more than about 90 percent accuracy. If the technique can be honed, though, it could make text searches more meaningful.

Ultimately, Clancy says, Google would like Book Search to give the same result as someone going to a library, looking in its stacks, and serendipitously finding a book that’s interesting or useful. One way to do this would be to link books to each other by categories and themes, he suggests. The task becomes more complicated, though, when linking works by Virginia Woolf, for instance, to criticisms of her work, works that inspired her, or authors who wrote during the same era. Designing algorithms that can effectively organize all of this new information, Clancy says, is “one of the grand challenges and will take many years.”

Reddy says CMU researchers are trying to tackle this challenge by using a “statistical approach” to organizing the information. In this approach, Virginia Woolf’s stream-of-consciousness sentences, for example, would be analyzed by an algorithm that would find patterns based on sentence length, structure, and punctuation. This technique might find a work by James Joyce, one of Woolf’s influences – or that of an obscure author whose writings might otherwise never have been found.

In the meantime, researchers are seeking shortcuts for searching among authors, books, and genres, Reddy says. Similar to the way “collaborative filtering” at Amazon uses people’s past purchases to help others find potential purchases, Book Search users could help each other. The community-based approach is an idea that Google has not announced, Clancy says, but it could add another layer to searching through books and create grassroots excitement about the project.

Certainly, holding its cards close is not new for Google. “They are secretive about almost everything they do…this is very common with Silicon Valley companies,” Reddy at CMU says. In the case of Book Search, he says, Google wants “to have a captive solution for all the libraries.” Even so, though, Reddy is excited about Google’s project and believes it will eventually complement his research. “I’m sure at some point they will have a pointer to our books,” he says.

Meanwhile, Google has to contend with the nontechnical issue of disgruntled copyright holders dragging them into court. The Author’s Guild and a number of publishers have sued them, claiming that Google’s project violates copyright law. (Stanford professor Lawrence Lessig has made a 30-minute video about the legal controversy.)

But if the legal and technical challenges can be overcome, digitized physical books could greatly surpass the billions of existing Web pages in breadth and depth of information. Indeed, a single comprehensive online card catalog for millions of the world’s books has the potential to create a whole new chapter in the information age.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.