Breaching the Walls
Even for authorized users, access to the Bodleian Library’s seven million volumes is anything but instant. If you are an Oxford undergraduate in need of a book, you first send an electronic request to a worker in the library’s underground stacks. (Before 2000 or so, you would have handed a written request slip to a librarian, who would have relayed it to the stacks via a 1940s-era network of pneumatic tubes.) The worker locates the book in a warren of movable shelves (a space-saving innovation conceived in 1898 by former British prime minister William Gladstone) and places it in a plastic bin. An ingenious system of conveyor belts and elevators, also built in the 1940s, carries the bin back to any of seven reading rooms, where it is unpacked, and the book is handed over to you.
The process can take anywhere from 30 minutes to several hours. But once you finally have the book, don’t even think about taking it back to your dorm room for further study. The Bodleian is a noncirculating legal deposit library, meaning that it is entitled to a free copy of every book published in the United Kingdom and the Republic of Ireland, and it guards those copies jealously. The library takes in tens of thousands of books every year, but the legend is that no book has ever left its walls.
But a digital book needn’t be loaned out to be shared. And Oxford’s various libraries have already created digital images of many of their greatest treasures, from ninth-century illuminated Latin manuscripts to 19th-century children’s alphabet books. Most of these images can be examined at high resolution on the Web. The only catch is that scholars have to know what they’re looking for in advance, since very few of the digital pages are searchable. Optical character recognition (OCR) technology cannot yet interpret handwritten script, so exposing the content of these books to today’s search engines requires typing their texts into separate files linked to the original images. A three-person team at Oxford, in collaboration with librarians at the University of Michigan and 70 other universities, is doing just that for a large collection of early English books, but the entire effort produces searchable text for only 200 books per month. At that rate, making a million books searchable would take more than 400 years.
That’s where Google’s resources will make a difference. Susan Wojcicki, a product manager at Google’s Mountain View, CA, campus and leader of the Google Print project, puts it bluntly: “At Google we’re good at doing things at scale.”
Google has already copied and indexed some eight billion Web pages, which lends credibility to its claim that it can digitize a big chunk of the 60 million volumes (counting duplicates) held by Harvard, Oxford, Stanford, the University of Michigan, and the New York Public Library in a matter of years. It will be a complex task, but one that is in some ways familiar for the company. “It’s not just feeding the books into some kind of digitization machine, but then actually taking the digital files, moving those files around, storing them, compressing them, OCR-ing them, indexing them, and serving them up,” points out Wojcicki. “At that point it becomes similar to all of Google’s other businesses, where we’re managing large amounts of data.” But the entire project, Wojcicki admits, hinges on those digitization machines: a fleet of proprietary robotic cameras, still under development, that will turn the digitization of printed books into a true assembly-line process and, in theory, lower the cost to about $10 per book, compared to a minimum of $30 per book today.
Neither Google nor its partner libraries have announced exactly how the process will work. But John Wilkin, associate university librarian at the University of Michigan, says it will go something like this: “We put a whole shelfful of books onto a cart, keeping the order intact. We check them out by waving them under a bar code reader. Overnight, software takes all the bar codes, extracts machine-readable records from the university’s electronic catalogue, and sends the records to Google, so they can match them with the books. Then we move the cart into Google’s operations room.”
This room will contain multiple workstations so that several books can be digitized in parallel. Google is designing the machines to minimize the impact on books, according to Wilkin. “They scan the books in order and return the cart to us,” he continues. “We check them back in and mark the records to show they’ve been scanned. Finally, the digital files are shipped in a raw format to a Google data center and processed to produce something you could use.”