Digitize This

Yahoo hopes to trump Google with its Open Content Alliance publishing venture.

Wade Rousharchive page

October 20, 2005

Google shook up the worlds of publishing and library science last year when it announced it would digitize millions of books from several of the world’s greatest libraries – including Oxford’s Bodleian Library and the New York Public Library – and make their contents searchable on the Web (see “The Infinite Library”).

Many librarians applauded Google’s move, and predicted it would jumpstart a broader effort to ensure universal electronic access to human knowledge. But publishers weren’t as pleased – particularly because Google said it would not seek permission to scan and index books still covered by copyright.

Now a group led by one of Google’s main rivals, Yahoo, is trying a more collective approach to digitization. On October 4, Yahoo and ten partner organizations announced the formation of the Open Content Alliance, which plans to build a free, permanent online repository for a wide range of print and multimedia content, including both copyrighted works and those that have passed into the public domain.

Yahoo’s partners in the alliance are Adobe Systems, the European Archive, Hewlett-Packard Labs, the Internet Archive, the National Archives of the United Kingdom, O’Reilly Media, the Prelinger Archives, the University of California, and the University of Toronto.

In contrast to Google’s approach, which requires publishers to “opt out” if they don’t want their works to be included, the alliance will only disseminate copyrighted works after their publishers have explicitly opted into the program, according to David Mandelbrot, Yahoo’s vice president for search technology.

Mandelbrot says the alliance will encourage other entities, including Google, to contribute to the repository, and will create a set of standards for digitization intended to make it easier to pool the products of various digitization efforts and to make them searchable from any search engine. Technology Review’s executive Web editor, Wade Roush, recently interviewed Mandelbrot about Yahoo’s approach to digitizing the world’s literature.

Wade Roush: How did the Open Content Alliance come about?

David Mandelbrot: In March of last year we launched our effort to partner with content rights holders. We wanted to move beyond what we could provide just by crawling the Web and improve the quality of Yahoo search. Soon after, we connected with the folks at the Internet Archive, who are doing great work with digitizing works. They were hosting a lot of great content and we wanted to integrate that into our search engine.

As we started that discussion, Brewster [Kahle, the founder of the Internet Archive] became focused on what can we do together to digitize content. They’ve developed a great scanning technology and a really good way to digitize works of literature, but they were looking for partners to help them get their message out there and get funding flowing. From those discussions, we decided to form this Open Content Alliance.

WR: What are the Alliance’s goals, and how will this program differ from other efforts – notably Google’s – to digitize large amounts of non-digital content?

DM: Over the time we were discussing forming the alliance, Google did launch their program, and we looked at their program for ideas about what they were doing and things we might want to do differently. We do want to have copyrighted works available through the Open Content Alliance – but only with the express permission of the copyright holder.

Secondly, we mainly want the alliance to focus on this theme of openness. One of the things we’ve seen with other [digitization] programs is they tend to use proprietary technologies to host the content, so it’s impossible for third-party search engines to crawl it. So we’re using XML and PDF and making the content easily crawlable by search engines. It was important to make this project open so that entities that contribute know they’re not just benefiting one search engine.

WR: So you feel you learned directly from the reaction Google’s project provoked from the publishing community.

DM: When the topic of making copyrighted works available came up, we always assumed we would need to get permission from copyright holders. We were surprised to see that in other programs, the copyrighted work would be made available without permission.

WR: Okay. But in Google’s defense, once you decide you’re going to seek express permission to digitize every work, that drastically limits the amount of content that will be available online, doesn’t it? Basically, we’re talking about everything published after 1923.

DM: There are great gains to be made just by digitizing public-domain works. Edgar Allan Poe and Henry James can be made available in their entirety online. In addition, we’ve been very excited about the response we’ve received from both the major publishing associations and the publishers themselves. Many are showing interest in working with us on this program. While there will be agreements that will need to be ironed out, we’re confident that we will be able to get a lot of copyrighted work online.

WR: That was going to be my next question: Do you think the Open Content Alliance’s approach will be more palatable to publishers?

DM: When it comes to these digitization efforts, the publishers have primarily been speaking through the publisher’s associations rather than individually, because they’re concerned about any kind of retribution that could come from search engines if they’re critical of any particular effort. But what we have heard from the publishers’ associations is that they’re very happy about the approach we’re taking. The Association of Learned Professional Society Publishers, for instance, has been very positive about our program, because of the fact we are working with the copyright holders in advance.

WR: All of these digitization efforts – yours, Google’s, others at universities and at the Library of Congress – are moving us toward Brewster Kahle’s dream of a universal online library. But they’re all going at the problem in different ways. Do you worry that the world’s literature will, in effect, become fragmented – that Web users will have to choose between the Open Content Alliance’s “universal” library and Google’s “universal” library?

DM: We’re encouraging participation in the alliance by all entities that are engaged in digitization efforts. The Open Content Alliance has already had a very preliminary discussion with Google about its participation, and we encourage Google to contribute work that they digitize to this alliance. We don’t see the alliance as offering a competing digitization effort, but rather as establishing a set of guidelines for the sharing of content.

WR: That’s very civic-minded. But what’s in it for Yahoo?

DM: At Yahoo our goal is to help people find, use, share, and expand all human knowledge. The alliance is really part of our effort to expand knowledge. If you look at our media business, we’re doing the same thing – particularly with some of our more recent efforts to offer content directly to people, such as the Yahoo Music venture. To the extent there is an expansion of human knowledge online, there is a great advantage to be had by Yahoo and other search companies, since there’s a need for ways to find that content.

Also, as part of our participation in the alliance, Yahoo will be providing search technology to allow people to search for content on the Open Content Alliance website. All of that content will also be searchable on Yahoo.

WR: What excites you personally about this project? Are you a closet bookworm?

DM: I am. When I was a kid, every day on my way home from school, I had to change buses at the local library, and every day I had a 45-minute block of time to spend in the library, just exploring what was there. Often 45 minutes would turn into an hour and 45 minutes, or two hours and 45 minutes.

What’s so exciting about this new mission is the ability to move a lot of great content in libraries into a digital format and take advantage of all the benefits of being digital: [it] can be open 24 hours and can be open worldwide. It can house so much more content than can be housed in a physical library – and not just books, but more and more libraries are moving beyond books to other media, like audio and video. It’s fun for us to participate in an alliance like this that allows us to create such an awesome library online.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.