The Library of Utopia

Google’s ambitious book-scanning program is foundering in the courts. Now a Harvard-led group is launching its own sweeping effort to put our literary heritage online. Will the Ivy League succeed where Silicon Valley failed?

Nicholas Carrarchive page

April 25, 2012

In his 1938 book World Brain, H.G. Wells imagined a time—not very distant, he believed—when every person on the planet would have easy access to “all that is thought or known.”

The 1930s were a decade of rapid advances in microphotography, and Wells assumed that microfilm would be the technology to make the corpus of human knowledge universally available. “The time is close at hand,” he wrote, “when any student, in any part of the world, will be able to sit with his projector in his own study at his or her convenience to examine any book, any document, in an exact replica.”

Wells’s optimism was misplaced. The Second World War put idealistic ventures on hold, and after peace was restored, technical constraints made his plan unworkable. Though microfilm would remain an important medium for storing and preserving documents, it proved too unwieldy, too fragile, and too expensive to serve as the basis for a broad system of knowledge transmission. But Wells’s idea is still alive. Today, 75 years later, the prospect of creating a public repository of every book ever published—what the Princeton philosopher Peter Singer calls “the library of utopia“—seems well within our grasp. With the Internet, we have an information system that can store and transmit documents efficiently and cheaply, delivering them on demand to anyone with a computer or a smart phone. All that remains to be done is to digitize the more than 100 million books that have appeared since Gutenberg invented movable type, index their contents, add some descriptive metadata, and put them online with tools for viewing and searching.

Google had the smarts and the money to scan millions of books into its database, but the major problems with constructing a universal library has little to do with technology.

It sounds straightforward. And if it were just a matter of moving bits and bytes around, a universal online library might already exist. Google, after all, has been working on the challenge for 10 years. But the search giant’s book program has foundered; it is mired in a legal swamp. Now another momentous project to build a universal library is taking shape. It springs not from Silicon Valley but from Harvard University. The Digital Public Library of America—the DPLA—has big goals, big names, and big contributors. And yet for all the project’s strengths, its success is far from assured. Like Google before it, the DPLA is learning that the major problem with constructing a universal library nowadays has little to do with technology. It’s the thorny tangle of legal, commercial, and political issues that surrounds the publishing business. Internet or not, the world may still not be ready for the library of utopia.

GOOGLE’S TRAVAILS

Larry Page isn’t known for his literary sensibility, but he does like to think big. In 2002, the Google cofounder decided that it was time for his young company to scan all the world’s books into its database. If printed texts weren’t brought online, he feared, Google would never fulfill its mission of making the world’s information “universally accessible and useful.” After doing some book-scanning tests in his office—he manned the camera while Marissa Mayer, then a product manager, turned pages to the beat of a metronome—he concluded that Google had the smarts and the money to get the job done. He set a team of engineers and programmers to work. In a matter of months, they had invented an ingenious scanning device that used a stereoscopic infrared camera to correct for the bowing of pages that occurs when a book is opened. The new scanner made it possible to digitize books rapidly without cutting off their spines or otherwise damaging them. The team also wrote character recognition software that could decipher unusual fonts and other textual oddities in more than 400 languages.

In 2004, Page and his colleagues went public with their project, which they would later name Google Book Search—a reminder that the company, at least originally, thought of the service essentially as an extension of its search engine. Five of the world’s largest research libraries, including the New York Public Library and the libraries of Oxford and Harvard, signed on as partners. They agreed to let Google digitize books from their collections in return for copies of the images. The company went on a scanning binge, making digital replicas of millions of volumes. It didn’t always restrict itself to books in the public domain; it scanned ones still under copyright, too. That’s when the trouble started. The Authors Guild and the Association of American Publishers sued Google, claiming that copying entire books, even with the intent of showing only a few lines of text in search results, constituted “massive” copyright infringement.

Google then made a fateful choice. Instead of going to trial and defending Book Search on grounds that it amounted to “fair use” of copyright-protected material—a case that some legal scholars believe it might have won—it negotiated a sweeping settlement with its adversaries. In 2008, the company agreed to pay large sums to authors and publishers in return for permission to develop a commercial database of books. Under the terms of the deal, Google would be able to sell subscriptions to the database to libraries and other institutions while also using the service as a means for selling e-books and displaying advertisements.

That only deepened the controversy. Librarians and academics lined up to oppose the deal. Many authors asked that their works be exempted from it. The U.S. Justice Department raised antitrust concerns. Foreign publishers howled. Last year, after a final round of legal maneuvering, federal district judge Denny Chin rejected the settlement, saying it “would simply go too far.” Listing a variety of objections, he argued that the pact would not only “grant Google significant rights to exploit entire books, without permission of the copyright owners,” but also reward the company for its “wholesale copying of copyrighted works” in the past. The company now finds itself nearly back at square one, with the original lawsuits slated to go to trial this summer. Facing new competitive threats from Facebook and other social networks, Google may no longer see Book Search as a priority. A decade after it began, Page’s bold project has stalled.

SEEKING ENLIGHTENMENT

If you were looking for Larry Page’s opposite, you would be hard pressed to find a better candidate than Robert Darnton. A distinguished historian and prize-winning author, a former Rhodes scholar and MacArthur fellow, a Chevalier in France’s Légion d’Honneur, and a 2011 recipient of the National Humanities Medal, the 72-year-old Darnton is everything that Page is not: eloquent, diplomatic, and embedded in the literary establishment. If Page is a bull in a china shop, Darnton is the china shop’s proprietor.

Robert Darnton has written that he wants to open up “nearly everything available in the walled-in repositories of human culture.”

But Darnton has one thing in common with Page: an ardent desire to see a universal library established online, a library that would, as he puts it, “make all knowledge available to all citizens.” In the 1990s he initiated two groundbreaking projects to digitize scholarly and historical works, and by the end of the decade he was writing erudite essays about the possibilities of electronic books and digital scholarship. In 2007 he was recruited to Harvard and named the director of its library system, giving him a prominent perch for promoting his dream. Although Harvard was one of the original partners in Google’s scanning scheme, Darnton soon became the most eminent and influential critic of the Book Search settlement, writing articles and giving lectures in opposition to the deal. His criticism was as withering as it was learned. Google Book Search, he maintained, was “a commercial speculation” that, under the liberal terms of the settlement, seemed fated to grow into “a hegemonic, financially unbeatable, technologically unassailable, and legally invulnerable enterprise that can crush all competition.” It would become “a monopoly of a new kind, not of railroads or steel, but of access to information.”

Darnton’s rhetoric seemed overwrought to some. University of Michigan librarian Paul Courant accused him of spreading “a dystopian fantasy.” But Darnton had cause to be concerned. Over the years, he had watched commercial publishers relentlessly ratchet up subscription prices for scholarly journals. Annual renewal fees had soared into the thousands of dollars for many periodicals, squeezing the budgets of research libraries. Darnton feared that Google, operating under the broad commercial protections granted by the settlement, would have the power to charge whatever it wanted for subscriptions to its database. Libraries might end up paying exorbitant sums to gain access to the very volumes they had let Google scan for free. The company’s executives, Darnton acknowledged, seemed to be filled with idealism and goodwill, but there was no guarantee that they, or their successors, would not become profit-hungry predators in the future. By allowing “the commercialization of the content of our libraries,” he argued, the agreement “would turn the Internet into an instrument for privatizing knowledge that belongs in the public sphere.”

If libraries and universities worked together, Darnton argued, with funding from charitable foundations, they could build a true digital public library of America. Darnton’s inspiration for the DPLA came not from today’s technologists but from the great philosophers of the Enlightenment. As ideas circulated through Europe and across the Atlantic during the 18th century, propelled by the technologies of the printing press and the post office, thinkers like Voltaire, Rousseau, and Thomas Jefferson came to see themselves as citizens of a Republic of Letters, a freethinking meritocracy that transcended national borders. It was a time of great intellectual fervor and ferment, but the Republic of Letters was “democratic only in principle,” Darnton pointed out in an essay in the New York Review of Books: “In practice, it was dominated by the wellborn and the rich.”

With the Internet, we could at last rectify that inequity. By putting digital copies of works online, Darnton has argued, we could open the collections of the country’s great libraries to anyone with access to the network. We could create a “Digital Republic of Letters” that would be truly free and open and democratic. The DPLA would allow us to “realize the Enlightenment ideals on which our country was founded.”

“TO BE DETERMINED”

Harvard’s Berkman Center for Internet and Society eagerly accepted Darnton’s challenge. It announced late in 2010 that it would coördinate an effort to build the DPLA and turn the Enlightenment dream into an Information Age reality. The project garnered seed money from the Alfred P. Sloan Foundation and attracted a steering committee that included a host of luminaries, including both Darnton and Courant as well as the chief librarian of Stanford University, Michael Keller, and the founder of the Internet Archive, Brewster Kahle. Named to chair the committee was John Palfrey, a young Harvard law professor and coauthor of influential books about the Internet. (Palfrey plans to leave Harvard on July 1 to become headmaster of Phillips Academy Andover, the Massachusetts prep school, but he says he will remain at the helm of the DPLA.)

University of Michigan librarian Paul Courant, now on the DPLA steering committee, saw benefits for the public in Google’s plan.

The Berkman Center set an ambitious goal of having the digital library begin operating, at least in some rudimentary form, by April of 2013. Over the past year and a half, the project has moved quickly on several fronts. It has held public meetings to promote the library, solicit ideas, and recruit volunteers. It has organized six working groups to wrestle with various challenges, from defining its audience to resolving technical issues. And it has conducted an open “beta sprint” competition to gather innovative operating concepts and useful software from a wide range of organizations and individuals.

When Judge Chin scuttled the Google deal last year, Darnton got a historic opportunity to cast the DPLA as the world’s best chance for a universal digital library. And indeed, it has gained broad support. Its plans have been praised by, among others, the Archivist of the United States, David Ferriero, and it has forged important partnerships, including one with Europeana, a European Commission–sponsored digital library with a similar concept.

However, the DPLA’s decision to call itself a “public library” has raised hackles. At a meeting in May of last year, a group called the Chief Officers of State Library Agencies passed a resolution asking the DPLA steering committee to change the name of the project. While the state librarians expressed support for an effort to “make the cultural and scientific heritage of our country and the world freely available to all,” they worried that by presenting itself as the country’s public library, the DPLA could lend credence to “the unfounded belief that public libraries can be replaced in over 16,000 communities in the U.S. by a national digital library.” Such a perception would make it even harder for local libraries to protect their budgets from cuts. Other critics have seen arrogance in the DPLA’s assumption that a single online library can support the very different needs of scholarly researchers and the public. To strengthen its ties to public libraries, the DPLA added five public librarians to its steering committee last year, including Boston Public Library president Amy Ryan and San Francisco city librarian Luis Herrera.

The controversy over nomenclature points to a deeper problem confronting the nascent online library: its inability to define itself. The DPLA remains a mystery in many ways. No one knows precisely how it will operate or even what it will be. Some of the vagueness is deliberate. When the Berkman Center launched the initiative, it wanted major decisions to be made in a collaborative and inclusive manner, avoiding top-down decrees that might alienate any of its many constituencies. But according to current DPLA officials and others involved in the project, the 17 members of the steering committee also have fundamental disagreements about the library’s mission and scope. Many important aspects of the effort remain, in Palfrey’s words, “to be determined.”

No consensus has been reached, for example, on the extent to which the DPLA will host digitized books on its own servers, as opposed to providing pointers to digital collections stored on the computers of other libraries and archives. Nor has the steering committee made a firm decision about which materials other than books will be included in the library. Photographs, motion pictures, audio recordings, images of objects, and even blog posts and online videos are all under consideration. Another open question, one with particularly far-reaching implications, is whether the DPLA will try to provide any sort of access to recently published books, including popular e-books. Darnton, for his part, believes that the digital library should steer clear of works published in the last five or 10 years, to avoid treading on the turf of publishers and public libraries. It would be a mistake, he warns, for the DPLA to “invade the current commercial market.” But while he says he has yet to hear anyone make a convincing counterargument, he admits that his view may not be held by everyone. Palfrey will only say that the DPLA is studying the issue of e-book lending but has yet to decide whether its scope will extend to recent publications.

Also unsettled is the critical question of how the DPLA will present itself to the public. David Weinberger, a Berkman researcher who is overseeing the development of the library’s technical platform, says that no decision has been reached on whether the DPLA will offer a “front-end interface,” such as a website or a smart-phone app, or whether it will restrict itself to being a behind-the-scenes data clearinghouse that other organizations can tap into. The technology team’s immediate goals are relatively modest. First the group wants to establish a flexible, open-source protocol for importing catalogue information and other data (such as records of how often books were borrowed) from participating institutions. Then it aims to organize that metadata into a unified database. And next it wants to provide an open programming interface for the database, with the hope of inspiring creative programmers to develop useful applications. Palfrey says that he expects the DPLA to operate its own public website, but he is wary of making any predictions about the functions of that site or the degree to which it may overlap with the online offerings of traditional libraries. While he hopes that the DPLA will be more than a “metadata repository,” he also says he would consider the effort a success even if it ultimately provided just the “plumbing” required to connect diverse and far-flung collections of materials.

Early copyright legislation guaranteed that no book would remain under private control for very long. Most works immediately entered the public domain.

It’s hardly surprising that a large and diverse steering committee would have difficulty reaching unanimity on complicated and weighty matters. And it’s understandable that the DPLA’s leaders would be nervous about making concrete decisions that would almost certainly upset some people in the library profession and the publishing business. But there’s growing tension between the heroic self-portrait that the DPLA presents to the public—its website proclaims that it “will make the cultural and scientific heritage of humanity available, free of charge, to all”—and the tentativeness and equivocation that cloud what is actually being built. If the uncertainties about the DPLA’s identity and workings aren’t cleared up, they could end up delaying or even waylaying the project.

THE COPYRIGHT WALL

Even if the views of the steering committee members were to come into harmony tomorrow, the ultimate form of the DPLA would remain hazy. The biggest question hanging over the project is one that can’t be decided by executive fiat, or even by methodical consensus building. It’s the same question that confronted Google Book Search and that bedevils every other effort to create an expansive online library: how do you navigate the country’s onerous copyright restrictions? “The legal problems are staggering,” Darnton says.

The U.S. Congress passed the first federal copyright law in 1790. Following English precedent, lawmakers sought to strike a reasonable balance between the desire of writers to earn a living and the benefit to society of giving people free access to the ideas of others. The law allowed “Authors and Proprietors” of “Maps, Charts and Books” to register a copyright in their work for 14 years and, if they were still alive at the end of that term, to renew the copyright for another 14 years. By limiting copy protections to a maximum of 28 years, the legislators guaranteed that no book would remain under private control for very long. And by requiring that copyrights be formally registered, they ensured that most works would immediately enter the public domain. Of the 13,000 books published in the country during the decade following the law’s enactment, fewer than 600 were registered for copyright, according to historian John Tebbel.

Beginning in the 1970s, Congress developed a radically different approach. Under pressure from film studios and other media and entertainment companies, it passed a series of bills that dramatically lengthened the term of copyright, not only for new books but retroactively for books published throughout most of the last century. Today, copyright in a work extends 70 years beyond the date of the author’s death. Congress also removed the requirement that an author register a copyright—and, again, it applied the change retroactively. Now a copyright is established for any work the moment it’s created. Even when writers have no interest in claiming a copyright, they get one—and their works remain out of the public domain for decades. The upshot is that most books or articles written since 1923 remain off limits for unauthorized copying and distribution. Other nations have enacted similar policies, as part of an effort to establish international standards for trade in intellectual property.

Politicians make lousy futurists. As Google and the DPLA can testify, the copyright changes put severe constraints on any attempt to scan, store, and provide online access to books published during most of the last 100 years. Moreover, the removal of the registration requirement means that millions of so-called orphan books—ones whose copyright holders either are unknown or can’t be found—now lie beyond the reach of online libraries. Copyright protections are vitally important to ensuring that writers and artists have the wherewithal to create their works. But it’s hard to look at the current situation without concluding that the restrictions have become so broad as to hamper the very creativity they were supposed to encourage. “Innovation is often being restricted today for legal reasons, not technological ones,” says David K. Levine, an economist at Washington University in St. Louis and coauthor of Against Intellectual Monopoly. In many areas, he says, “people aren’t creating new products because they fear a nightmare of copyright litigation.”

Internet Archive founder Brewster Kahle says the DPLA should support a network of libraries and not build a centralized one.

There’s a further twist. Books and other creative works behind the copyright wall aren’t all that could be off limits. Much of the metadata that libraries employ to catalogue their holdings falls into a gray area with regard to how it can be reused. That’s because many libraries purchase or license metadata from commercial suppliers or from the OCLC, a large library coöperative that syndicates an array of cataloguing information. And because librarians have long used metadata from many sources in classifying their holdings, it can be extraordinarily difficult to sort out what’s under license and what’s not, or who owns what rights. The confusion makes even the DPLA’s seemingly modest effort to collect metadata fraught with complications, according to David Weinberger. He says the DPLA is making progress at solving this problem, but when the library opens its virtual doors, patrons may have to make do with scanty descriptions of its contents.

DREAMS AND REALITIES

Some scholars believe that copyright restrictions will frustrate any attempt to create a universal online library unless Congress changes the law. James Grimmelmann, a copyright expert at New York Law School, feels that it will be “very, very hard” to include orphan works in a digital database without new legislation. Siva Vaidyhanathan, a University of Virginia media studies professor who wants to build an international project to organize research materials online, believes that major changes in copyright law are essential to creating a digital library that includes recent works. He senses that it may take many years of public pressure to get politicians to deliver the necessary remedies.

While Palfrey is hesitant to discuss legal issues, he expresses some hope that progress can be made without congressional action. He feels that the DPLA may be able to hash out an agreement with publishers and authors that would enable it to offer access to at least some of the orphans and other books published since 1923. The DPLA may, according to some copyright experts, have an advantage over Google Book Search in negotiating such an agreement and getting it blessed by the courts: it’s a nonprofit.

The DPLA has made it clear that it will be meticulous in respecting copyrights. If it can’t find a way around current legal constraints, whether through negotiation or through legislation, it will have to limit its scope to books that are already in the public domain. And in that case, it’s hard to see how it would be able to distinguish itself. After all, the Web already offers plenty of sources for public-domain books. Google still provides full-text, searchable copies of millions of volumes published before 1923. So do the HathiTrust, a vast book database run by a consortium of libraries, and Brewster Kahle’s Internet Archive. Amazon’s Kindle Store offers thousands of classic books free. And there’s the venerable Project Gutenberg, which has been transcribing public-domain texts and putting them online since 1971 (when the project’s creator typed the Declaration of Independence into a mainframe at the University of Illinois). Although the DPLA may be able to offer some valuable features of its own, including the ability to search collections of rare documents held by research libraries, those features would probably interest only a small group of scholars.

Despite the challenges it faces, the Digital Public Library of America has an enthusiastic corps of volunteers and some generous contributors. It seems likely that by this time next year, it will have reached its first milestone and begun operating a metadata exchange of some sort. But what happens after that? Will the library be able to extend the scope of its collection beyond the early years of the last century? Will it be able to offer services that spark the interest of the public? If the DPLA is nothing more than plumbing, the project will have failed to live up to its grand name and its even grander promise. The dream of H. G. Wells—and, for that matter, Robert Darnton—will have been deferred once again.

Nicholas Carr writes about technology and culture for several publications, including the Atlantic. His most recent book is The Shallows: What the Internet Is Doing to Our Brains.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.