NARA’s crash data-preservation project is coming none too soon; today’s history is born digital and dies young. Many observers have noted this, but perhaps none more eloquently than a U.S. Air Force historian named Eduard Mark. In a 2003 posting to a Michigan State University discussion group frequented by fellow historians, he wrote: “It will be impossible to write the history of recent diplomatic and military history as we have written about World War II and the early Cold War. Too many records are gone. Think of Villon’s haunting refrain, ‘Ou sont les neiges d’antan?’ and weep….History as we have known it is dying, and with it the public accountability of government and rational public administration.” Take the 1989 U.S. invasion of Panama, in which U.S. forces removed Manuel Noriega and 23 troops lost their lives, along with at least 200 Panamanian fighters and 300 civilians. Mark wrote (and recently stood by his comments) that he could not secure many basic records of the invasion, because a number were electronic and had not been kept. “The federal system for maintaining records has in many agencies – indeed in every agency with which I am familiar – collapsed utterly,” Mark wrote.
Of course, managing growing data collections is already a crisis for many institutions, from hospitals to banks to universities. Tom Hawk, general manager for enterprise storage at IBM, says that in the next three years, humanity will generate more data–from websites to digital photos and video–than it generated in the previous 1,000 years. “It’s a whole new set of challenges to IT organizations that have not been dealing with that level of data and complexity,” Hawk says. In 1996, companies spent 11 percent of their IT budgets on storage, but that figure will likely double to 22 percent in 2007, according to International Technology Group of Los Altos, CA.
Still, NARA’s problem stands out because of the sheer volume of the records the U.S. government produces and receives, and the diversity of digital technologies they represent. “We operate on the premise that somewhere in the government they are using every software program that has ever been sold, and some that were never sold because they were developed for the government,” says Ken Thibodeau, director of the Archives’ electronic-records program. The scope of the problem, he adds, is “unlimited, and it’s open ended, because the formats keep changing.”
The Archives faces more than a Babel of formats; the electronic records it will eventually inherit are piling up at an ever accelerating pace. A taste: the Pentagon generates tens of millions of images from personnel files each year; the Clinton White House generated 38 million e-mail messages (and the current Bush White House is expected to generate triple that number); and the 2000 census returns were converted into more than 600 million TIFF-format image files, some 40 terabytes of data. A single patent application can contain a million pages, plus complex files like 3-D models of proteins or CAD drawings of aircraft parts. All told, NARA expects to receive 347 petabytes (see “Definitions” on previous page) of electronic records by 2022.
Currently, the Archives holds only a trivial number of electronic records. Stored on steel racks in NARA’s 11-year-old facility in College Park, the digital collection adds up to just five terabytes. Most of it consists of magnetic tapes of varying ages, many of them holding a mere 200 megabytes apiece–about the size of 10 high-resolution digital photographs. (The electronic holdings include such historical gems as records of military psychological-operations squads in Vietnam from 1970 to 1973, and interviews, diaries, and testimony collected by the U.S. Department of Justice’s Watergate Special Prosecution Force from 1973 to 1977.) From this modest collection, only a tiny number of visitors ever seek to copy data; little is available over the Internet.
Because the Archives has no good system for taking in more data, a tremendous backlog has built up. Census records, service records, Pentagon records of Iraq War decision-making, diplomatic messages – all sit in limbo at federal departments or in temporary record-holding centers around the country. A new avalanche of records from the Bush administration – the most electronic presidency yet–will descend in three and a half years, when the president leaves office. Leaving records sitting around at federal agencies for years, or decades, worked fine when everything was on paper, but data bits are nowhere near as reliable – and storing them means paying not just for the storage media, but for a sophisticated management system and extensive IT staff.