Net Worth

Efforts to preserve the Web should make use of the powerful, distributed collaboration it allows.

Kris Carpenter Negulescuarchive page

December 20, 2011

The challenge of collecting and preserving the Web, or even a representative sample of it, is a daunting one (see “Fire in the Library”). It is not enough to simply capture the information a website contained, be that text, images, or video. We must preserve something of the experience and activity a site supported. How a site was accessed, who linked to it, and how that changed over time provide important context for critical events such as the recent tsunami in Japan or the events of 9/11, which are relatively distant at the speed at which the Web evolves and leaves data behind. No lone institution can attempt to preserve all that. It will take the commitment of a critical mass of government institutions, companies, nonprofits, and more to ensure the longevity of our digital heritage, nationally and globally.

Current notions of what the Web represents socially, culturally, politically, economically, legally, and even scientifically vary depending on where you happen to live in the world. The value systems to which you subscribe shape what you see in the Web. This is an advantage when thinking of how to preserve the diversity of experience online. Unfortunately, many factors work against the cross-cultural collaboration needed to preserve the Web’s diversity at scale. Local legislation can hinder attempts to share information; companies can fear negative commercial consequences from providing access to their data; and limited budgets constrain the few organizations, such as the Internet Archive, that are dedicated to preserving the Web.

In a perfect world, this would not be the case. Individuals, governments, universities, libraries, and corporations would all work to preserve the world’s most vibrant cultural medium. Imagine for a moment an approach to preservation that builds on the fundamental strengths of the Internet itself—distributed, ubiquitous, relatively inexpensive, not easily quelled or manipulated by any single actor. “Netizens” from around the globe would work to build a unified Web archive spanning cultural, political, and commercial boundaries. Subject-matter experts would ensure that their spheres were adequately represented; others would confirm that a representative sample across all domains was being collected.

The result would not be a single resource but, rather, a distributed collection of them. We would need the equivalent of search engines for this Web of the past, and new tools to mine, graph, and study it.

Making this happen would require a global willingness to exchange data for long-term preservation. Is this too far-out to imagine? Perhaps. But such coöperation is appearing within international research communities and cultural groups in both Europe and the United States. This work creates a foundation we can build upon. Only by encouraging this type of collaboration among like-minded communities can we hope to preserve any significant slice of the Web. The future does not afford anyone the luxury of the unlimited time, funds, computing power, and storage capacity that would be needed to do it alone.

Kris Carpenter Negulescu is director of Web archiving at the Internet Archive, a nonprofit Internet library that preserves digital content.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.