Data under the Desk
The good news is that at least some of the rocket science behind the Archives’ “moon shot” is already being developed by industry, other U.S. government agencies, and foreign governments. For example, Hewlett-Packard, IBM, EMC, PolyServe, and other companies have developed “virtual storage” technologies that automatically spread terabytes of related data across many storage devices, often of different types. Virtualization frees up IT staff, balances loads when demand for the data spikes, and allows hardware upgrades to be carried out without downtime. Although the Archives will need technologies far beyond virtual storage, the commercial efforts form a practical foundation. The Archives may also benefit from the examples of digital archives set up in other nations, such as Australia, where archivists are using open-source software called XENA (for XML Electronic Normalizing of Archives) to convert records into a standardized format that will, theoretically, be readable by future technologies. NARA will also follow the lead of the U.S. Library of Congress, which in recent years has begun digitizing collections ranging from early American sheet music to immigration photographs and putting them online, as part of a $100 million digital preservation program.
But to extend the technology beyond such commercial and government efforts, NARA and the National Science Foundation are funding research at places like the San Diego Supercomputer Center. There, researchers are, among other things, learning how to extract data from old formats rapidly and make them useful in modern ones. For example, San Diego researchers took a collection of data on airdrops during the Vietnam War – everything from the defoliant Agent Orange to pamphlets – and reformatted it so it could be displayed using nonproprietary versions of digital-mapping programs known as geographic information systems, or GIS (see “Do Maps Have Morals?” Technology Review, June 2005). Similarly, they took lists of Vietnam War casualties and put them in a database that can show how they changed over the years, as names were added or removed. These are the kinds of problems NARA will face as it “ingests” digital collections, researchers say. “NARA’s problem is they will be receiving massive amounts of digital information in the future, and they need technologies that will help them import that data into their ERA – hundreds of millions of items, hundreds of terabytes of data,” says Reagan Moore, director of data-knowledge computing at the San Diego center.
Another hive of research activity on massive data repositories: MIT. Just as the government is losing its grip on administrative, military, and diplomatic history, institutions like MIT are losing their hold on research data – including the early studies and communications that led to the creation of the Internet itself. “MIT is a microcosm of the problems [NARA] has every day,” says MacKenzie Smith, the associate director for technology at MIT Libraries. “The faculty members are keeping their research under their desks, on lots and lots of disks, and praying that nothing happens to it. We have a long way to go.”
Now MIT is giving faculty another place to put that data. Researchers can log onto the Internet and upload information – whether text, audio, video, images, or experimental data sets – into DSpace, a storage system created in collaboration with Hewlett-Packard and launched in 2002 (see “MIT’s DSpace Explained”). DSpace makes two identical copies of all data, catalogues relevant information about the data (what archivists call “metadata,” such as the author and creation date), and gives each file a URL or Web address. This address won’t change even if, say, the archivist later wants to put a given file into a newer format – exporting the contents of an old Word document into a PDF file, for instance. Indeed, an optional feature in DSpace will tell researchers which files are ready for such “migration.”
Because the software behind DSpace is open source, it is available for other institutions to adapt to their own digital-archiving needs; scores have already done so. Researchers at MIT and elsewhere are working on improvements such as an auditing feature that would verify that a file hasn’t been corrupted or tampered with, and a system that checks accuracy when a file migrates into a new format. Ann Wolpert, the director of MIT Libraries (and chair of Technology Review’s board of directors), says DSpace is just a small step toward tackling MIT’s problems, never mind NARA’s. “These changes have come to MIT and other institutions so rapidly that we didn’t have the technology to deal with it,” Wolpert says. “The technology solutions are still emerging.” Robert Tansley, a Hewlett-Packard research scientist who worked on DSpace, says the system is a good start but cautions that “it is still quite new. It hasn’t been tested or deployed at a massive scale, so there would need to be some work before it could support what the National Archives is looking at.”