Data Extinction

It’s too late for old word-processing files. But new technologies will preserve access to digital photos, music and other electronic records forever.

Claire Tristramarchive page

October 1, 2002

In 1988 Keith Feinstein bought a Star Wars arcade game for his college dorm room. Besides keeping him in beer and pizza money for the next four years, it also launched him on a personal journey that has lasted into the present: he now owns more than 900 vintage video arcade games, which he exhibits in a traveling show known as Videotopia. “People cry,” says Feinstein, who is now 34, and who remembers a childhood complete with the earliest Pong console and an Atari 2600 he loved. “They can walk into an exhibit with hundreds of machines, and in all that incredible cacophony, they run right to their game. These games were a part of our lives. They were our first interactive media.” Some of Feinstein’s lovingly preserved devices are probably the last working models on the planet-the only machines where the 20-year-old software behind these games can come alive on the hardware it was meant for.

Just about the time Feinstein bought his first arcade game, Abby Smith was completing a PhD in medieval Russian history at Harvard University. She was troubled, though, that only a handful of writings from before the 14th century-mostly liturgical documents-had survived the tumult of Russian history. How much had been irretrievably lost? How much of her own time was going to be lost to the future? Something about those questions struck Smith as far more interesting than the work she was doing, so she threw over Russian history to specialize instead in library science. For the past two decades, Smith has helped the U.S. Library of Congress in its task of preserving history. At first she occupied herself with such tasks as saving Lincoln’s original Gettysburg address from deterioration, but as our culture has grown more digital, Smith has in turn become ever more focused on solving the problem of preserving digital artifacts. She is currently director of programs at the Council on Library and Information Resources, a Washington, DC, nonprofit organization that’s helping the Library of Congress draft a proposal asking legislators to fund research on a long-term solution. “The layman’s view is that digital information is more secure, when in fact it’s far more ephemeral,” she says. “We know how to keep paper intact for hundreds of years. But digital information is all in code. Without access to that code, it’s lost.”

Smith and Feinstein are working opposite ends of the same problem: how to preserve digital things-data, software and the electronics needed to read them-as they age. Paper documents last for hundreds of years, but more and more of what matters to us is digitally produced, and we can’t guarantee that any of it will be usable 100, or 10, or even five years from now. Feinstein’s contribution toward staving off digital obsolescence is to scour flea markets for old circuit boards that might have the chips he needs to repair old games; he is obsessed with keeping every game in his collection working. Smith’s approach is to develop a plan for preserving culture itself; she is obsessed with guaranteeing, for example, that 300 years from now, people will be able to read files that locate nuclear-waste sites. Both are faced with the knowledge that current methods for preserving digital things work poorly, even in the short term.

Just how bad is the problem? Examples of digital things lost forever abound, some personal in scale, some global. Software patents that can be infringed freely because the original software no longer works, preventing the patent holders from proving prior art. Land use and natural-resource inventories for the State of New York compiled in the late 1960s that can’t be accessed because the customized software needed to open the files no longer exists. NASA satellite data from the 1970s that might have helped us understand global warming, were they not unreadable today.

But far worse is yet to come. “Once you begin to understand what’s going on at a more technical level,” says Smith, “you realize that what’s lost could be catastrophic.” We can count on paper documents to last 500 years or longer, barring fire, flood or acts of God. But digital things, be they documents, photographs or video, are all created in a language meant for a specific piece of hardware; and neither computer languages nor machines age well. The amount of material at risk is exploding: the volume of business-related e-mail is expected to rise from 2.6 trillion messages per year in 2001 to 5.9 trillion by 2005, according to IDC, an information technology analysis firm. Maybe most of those messages deserve to be rendered unreadable, but critical documents and correspondence from government and private institutions are in just as much danger of digital obsolescence as spam.

Then there are databases, and software, and images, all of which are in a constant state of change: JPEG, for example, the standard many digital-camera users rely on to store family photos, is already in the process of being outmoded by JPEG 2000, a higher-quality compression standard. “Unless we do something drastic,” says Margaret Hedstrom, professor of information at the University of Michigan’s School of Information, “in one or two or five years it’s going to be very difficult for people to look back and see the photos they took.”

Proposed solutions include migration, which consists of updating or sometimes entirely rewriting old files to run on new hardware; emulation, a way of mimicking older hardware so that old software and files don’t have to be rewritten in order to run on new machines; and more recently, encapsulation, a way of wrapping an electronic document in a digital envelope that explains, in simple terms, how to re-create the software, hardware or operating systems needed to decode what’s inside.

All three solutions, however, have the same sticky problem: the fixes themselves are time-bound, able to work only for several years, or perhaps a few decades, before another fix needs to be made. They also require us to act now to preserve what we think might be important to the future. “We have the problem of how to preserve digital media-hard enough to solve-and we have the additional, impossible responsibility of deciding what to save,” says Smith. “Nothing will be preserved by accident.”

A newly proposed solution, ironically enough, might make use of a very old technology: paper itself. Not to preserve all the digital documents we are creating in hard copy, but rather to preserve the specifications for a decoding mechanism-a kind of “universal computer” defined by a few hundred lines of software code-that will allow the documents to be deciphered in the future. Archived on paper and across the Internet, the mechanism would be guaranteed to survive for centuries. Proponents of such an approach say it will make it possible to preserve everything-a complete record of humanity. Maybe then history can finally stop repeating itself.

What’s So Hard about Digital Preservation?

The naive view of digital preservation is that it’s merely a question of moving things periodically onto new storage media, of making sure you copy your files from eight-inch floppy disks to five-and-a-quarter, to three-and-a-half, to CD, and on to the next thing before the old format fades away completely. But moving bits is easy. The problem is that the decoding programs that translate the bits are usually junk within five years, while the languages and operating systems they use are in a state of constant change.

Every piece of software, and every data file, is at its heart written to instruct a given piece of hardware to perform certain tasks. In other words, it is written in the language of a machine, not of humans. Whenever you create a digital thing, be it a document, a database, a program, an image or a piece of music, it is stored in a form that you can’t read. “It’s like it was written in invisible ink,” says Jeff Rothenberg, a researcher at Rand, a think tank in Santa Monica, CA. “As soon as it’s stored it disappears from human eyes, and you need the right resources to render it visible again, just like invisible ink needs some sort of solvent to be read.” Yet rebuilding old hardware or keeping it around forever to interpret nearly extinct software or formats is economically prohibitive: when shippers dropped one of Feinstein’s vintage arcade games, shattering it, its original manufacturer calculated the insurance costs to restore the cabinet alone at $150,000, while making new chips for the game-from dies that no longer exist-would have cost millions.

Software companies confront the problem of digital preservation every day as they update their code, making sure it works with the latest hardware and operating systems, while at the same time ensuring that customers can access old files for a reasonable amount of time. But without some sort of digital resuscitation, every application-from the original binary codes written in the 1940s to WordPerfect to the latest million-dollar database application-eventually stops working, and every data file eventually becomes unreadable. Every application and every file.

The evolution of operating systems-the programs that allow other programs to run-provides yet another challenge. As Microsoft improves Windows, for example, it introduces new guidelines for programmers, known as application programming interfaces every few months, adding some features and taking others away. In each new release, some interfaces are “deprecated,” meaning that programmers are advised to stop using them in the software they write. But what does that mean for programs written before the change? Most programs that use deprecated features will work for a while but they access the underlying architecture in a less direct way than the newer interfaces do, and the program is likely to run more slowly. How long before it stops? Most people actively trying to keep old files and applications operational say that five years is pushing it. “Interfaces change continually,” says one Windows developer. “It’s like asking how often the beach changes shape. Sometimes big storms come and nothing looks the same.”

But when programs are painstakingly rewritten to conform to new operating-system guidelines, they eventually become unable to access files created by their own precursors. “I frankly don’t expect to have a version of Quicken in 10 years that will be able to read my tax files from today,” says Gordon Bell, who led the development of some of the first minicomputers as vice president of research and development at Digital Equipment, and who now works as a senior researcher at Microsoft’s Bay Area Research Center. “Especially anything that is database oriented, with a lot of complexity in the data structure, is difficult to move from one generation to the next.”

Migration: Digital Transplant Operations

One of the most common methods for preserving digital information is migration, where the bits in a file or program are altered to make them readable by new hardware and operating systems. It’s what happens when you open an old document, such as a Microsoft Word 95 file, with a new iteration of the same software, say Microsoft Office 2001. The drawbacks? Each file needs to be opened, converted and saved individually, a process that grows impossibly large when you consider a librarian’s or archivist’s initiative to save as much of the historical record as possible. And eventually even the most meticulous of software companies stops supporting old versions of its products. If a file has not been migrated before that time, it’s digital gibberish.

Worse, each time a file is migrated, some information is irreversibly lost. “Imagine someone saying, Okay, the way we’re going to preserve Rembrandt is that five years from now we’re going to have another artist come in and copy his paintings, and then we’ll throw away the original,’” says Rand’s Rothenberg. “And so on after another five years. The notion is laughable with art, because you know that every time you copy, you corrupt. It’s the same with computers.”

Migrating text files is hard enough; migrating application software is even more so. Indeed, the term “migrating” is a misnomer, since it often means throwing out the old program and writing an entirely new one in a new programming language, a process that programmers prefer to call “porting.” The new program may look the same on the monitor, but underneath it is new. No matter how carefully software engineers have worked to simulate the old program, every line of code is different, with new bugs and new idiosyncrasies.

In any case, it’s rarely the goal of the new program to simulate the old one exactly; it’s far more common for programmers to want to improve upon the past. That’s a goal that keeps computer science advancing at an exponential rate, and it probably also explains why the technical problem of preserving the past has received so little attention from those who helped create the problem in the first place.

“Computer scientists are in a profession where there is virtually no need for historical information,” says Abby Smith. “They don’t need information from the 1650s or the 1940s. They are used to things superseding what came before. For those in the humanities, there is no such notion. They work by accumulation, not replacement.”

Emulation: Digital CPR

An even purer example of the problems associated with preserving digital objects is seen in the widespread attempt to keep arcade games like Joust and Asteroids playable today. Feinstein is keeping old games alive by preserving the machines that run them, but many others are trying a different means: hacks are importing the games onto today’s PCs.

Such hacks use a technique called emulation, creating a program that simulates the registers (storage locations in the central processing unit) and behaviors of the old machine, and which can fool old games into thinking they are being run on old hardware. Emulation has the advantage of keeping the original bits of a given file or program intact, warts and all. “In porting, it’s difficult to capture the bugs and idiosyncrasies of the original,” says Jeff Vavasour, chief technical officer of Emeryville, CA-based Digital Eclipse, which is currently writing software to revive the original Joust and other arcade classics. “In games, that’s important. So we don’t port. We use emulation instead.”

Indeed, emulation has been proposed as a way to keep not just games but everything else digital alive. It has its own drawbacks, however. “Emulation doesn’t preserve, it just mimics,” says Feinstein. “The timing will be all wrong. Or the sound will be off…. It’s like the guy who reshot the film Psycho using Hitchcock’s shot book. You recognize something of the original, but mostly you recognize how different it is from the original.”

Looking for hard evidence to support claims like Vavasour’s, that emulation is better at preserving digital content’s original look and feel, Hedstrom and his colleague Cliff Lampeso at the University of Michigan recently organized one of the first studies to compare migrated and emulated versions of the same software. Subjects first spent an hour learning the maze game Chuckie Egg on its original platform, the BBC Micro, a microcomputer popular in Britain in the mid-1980s. They then played the game twice more on modern PCs, once with a version that had been migrated into a modern computer language and again with the original BBC Micro code running inside an emulator. Hedstrom and Lampeso found no statistically significant difference in the way the subjects rated the performance of the two versions. Says Hedstrom,”It was not apparent that emulation did a better job.”

Nonetheless, some computer scientists have suggested “chains” of emulators as a temporary solution to the problem of digital obsolescence: as each generation of hardware grows obsolete, it will be replaced by a layer of emulation software. But it’s an idea that has others shaking their heads. “It’s extremely dangerous to talk about emulation as a solution,” says David Bearman, president of Archives and Museum Informatics, a consulting group that works with business and government entities, helping them preserve digital files. “It gives an excuse to managers and governments around the world to put off doing things that really need to be done right now.”

Encapsulation: Digital Cryonics

Neither migration nor emulation, then, offers a satisfactory long-term way to wrest digital bits from what Shakespeare called “the wrackfull siege of batt’ring days.” The only real way to keep digital things alive for the duration, many believe, is to lift them out of this inexorable march of digital progress-but to leave signposts that will tell future generations how to reconstruct what has passed.

Consortia of libraries and archivists worldwide are working on a solution called encapsulation: a way to group digital objects together with descriptive “wrappers” containing instructions for decoding their bits in the future. A wrapper would include both a physical outer layer, similar to the jacket of a floppy disk, imprinted with human-readable text describing the encapsulated content and how to use it, and a digital inner layer containing the specifications for the software, operating system and hardware needed to read the object itself. A Microsoft Word document, for example, might be packaged with instructions for re-creating Word, Windows and perhaps even an emulated version of a Wintel PC. For text documents, at least, encapsulation seems likely to be a viable method for long-term preservation, especially once international standards bodies agree on a uniform system for building wrappers. But if the documents being preserved contain more than simple text, encapsulation seems less likely to succeed: there are simply too many new software releases, compression schemes and hardware formats each year to describe all of them through encapsulation.

“The pagination is off even when you open a last-generation Word document,” observes Steve Gilheany, senior systems engineer at Archive Builders, a Manhattan Beach, CA-based records-management consulting group that has assisted the city of Los Angeles in its digital-document preservation. “Imagine then what happens when you try to open it in a hundred years or try to access a digital object more complicated than pages of text.”

Gilheany’s proposed solution is simpler, borrowing the concept behind that archetypal decryption key, the Rosetta stone. He recommends archiving critical files in at least three formats: The first would be a standard raster or bit-map format, where there is a one-to-one correspondence between how coordinates are stored and how they are displayed, without the kind of compression used today for large files like JPEG images. The second would be the file’s native format, whatever it happens to be, to simplify any future modifications. The third would be a “vector-based” format storing each letter, symbol or image in the form of a mathematical description of its shape on the page; Adobe Systems’ Portable Document Format is one example. In theory, each version could be used to decode the others. Gilheany has spent eight years assisting the Los Angeles city government in converting its original infrastructure documents into raster and PDF files, and in the absence of a better solution, most government agencies and others with critical archival needs are taking a similar approach.

Encapsulation and conversion, though, require foresight; as Smith notes, anything that isn’t expressly encapsulated or converted will surely disappear. These solutions also aren’t particularly long lived, at least compared with things like stone hieroglyphs or even paper. “Some researchers predict very long lifetimes for some types of media,” says Raymond Lorie, a research fellow at IBM’s Almaden Research Center in San Jose, CA. “But if a medium is good for N years, what do we do for N-plus-one years? Whatever N is, the problem does not go away.”

The Universal Virtual Computer

Proponents of emulation and encapsulation are thinking the wrong way, Lorie believes. Packaging complex data with the software needed to read it is too complicated, he thinks, and saving data in simple formats and trusting that someone a century hence will still be able to decode them is too risky. Instead, he’s building a universal decoding machine-a primitive program that would begin working behind the scenes to preserve a digital thing as soon as it was created-and proposing that it be promulgated so widely that it would become an inextricable part of our culture, like copies of the Bible or the U.S. Constitution. This program would be written in a simple machine language; it could be used to unlock files and to run application software even after the formats in which the files are stored grow obsolete; and most important, it wouldn’t require any particular foresight about which things should be saved.

Lorie believes that this program, which he calls the universal virtual computer, should be constructed independently of any existing hardware or software, so that it is independent, too, of time. It would simulate the same basic architecture that every computer has had since the beginning: memory, a sequence of registers, and rules for how to move information among them. Computer users could create and save digital files using the application software of their choice; when a digital file was saved, though, it would also be backed up in a file that could be read by the universal computer. When someone wanted to read the file in the future, only a single emulation layer-between the universal virtual computer and the computer of that time-would be needed to access it.

“Ray’s suggested universal virtual computer is a good idea,” comments Rand’s Rothenberg. In fact, he says it’s one possible version of a concept he has been developing himself, something called the “emulation virtual machine.” Rothenberg’s machine would be a universal platform for emulating obsolete computers, which could then run obsolete software to render obsolete digital objects. Lorie’s solution, Rothenberg says, is similar in spirit but “far less general.”

Lorie, however, believes in keeping things simple-so simple, in fact, that he wants to fit the specifications for his universal computer into only 10 to 20 pages of text, which could be distributed via the Web and copied out on paper everywhere, assuring their survival. “Saving one single paper document allows us to save millions of documents around the world,” he says.

Will it work? Last September Lorie demonstrated his approach at the National Library of the Netherlands, successfully translating a PDF version of a scientific paper on drug research into his universal format. The reconstruction not only kept the look of the original’s fonts and formatting, it also created “metadata” to clue in future users about its content.

In addition to text files, Lorie’s approach could also be used to save today’s digital photographs, sound and video files, and software applications for future generations; the content or software need only be described and saved in a way that is compatible with the universal computer. But he believes that the ability to decode today’s data files will be far more valuable than the ability to run old software. Imagine, for example, being able to view data not just with today’s visualization tools but in ways that won’t be invented for another hundred years. “It’s not just that you want to save the document,” he explains. “You want to make the data within the document available to whatever new programs we may have in the future.”

Unawakened Demand

Boiling down the specifications for the universal virtual computer into a handful of pages poses technical problems that Lorie believes can be solved. But will they be? Like everything else involving information technology, they won’t be until there is enough demand to pay for the development work. By that time, however, many digital things may be past the point of resuscitation. Lorie is the only researcher at IBM with funding to study the universal virtual computer. “I wish I could say I have 20 people working on the problem, but I don’t,” he says.

Robert Morris, director of the Almaden lab and Lorie’s boss, doesn’t equivocate. “It’s unfortunate, but the reason there’s not a huge amount of activity is because there isn’t a lot of money in it,” he says. “At the moment there are not a lot of people clamoring to solve this problem.”

That may change as computer users realize how much has already evaporated. In October 2001 Brewster Kahle, the man behind a project known as the Internet Archive, put up a Web site known as the Wayback Machine, a way for people to search the archive’s collection of 10 billion Web pages it had crawled over the previous five years. With 1997-era Web pages in his archive, Kahle is already grappling with preservation questions. Many of the pages suffer from broken links and half-missing text, and whole classes of items-those protected by passwords or payments, for example-aren’t archived at all. “We don’t know how much we’ve lost,” he says.

Like global warming, the problem of digital preservation is so big that it’s hard to grasp. But when a million people are using the Wayback Machine and not finding the digital files they’re searching for? Then the problem starts to become real.

“People count on libraries to archive human creativity,” Abby Smith says. “It’s important for people to know, though, that libraries are at a loss about how to solve this problem.” When computer users are saving documents or images, they don’t think twice about making them accessible to future generations, she says. “They need to.”

Digital-Preservation Proposals

Technique	Description	Pros	Cons
Migration	Periodically convert digital data to next-generation formats	Data are “fresh” and instantly accessible	Copies degrade from generation to generation
Emulation	Write software mimicking older hardware or software, tricking old programs into thinking they are running on their original platforms	Data don’t need to be altered	Mimicking is seldom perfect; chains of emulators eventually break down
Encapsulation	Encase digital data in physical and software “wrappers,” showing future users how to reconstruct them	Details of interpreting data are never separated from the data themselves	Must build new wrappers for every new format and software release; works poorly for nontextual data
Universal virtual computer	Archive paper copies of specifications for a simple, software-defined decoding machine; save all data in a format readable by the machine	Paper lasts for centuries; machine is not tied to specific hardware or software	Difficult to distill specifications into a brief paper document

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.