MIT’s Superarchive

A digital repository will revolutionize the way research is shared and preserved.

Sally Atwoodarchive page

December 1, 2002

Every year MIT researchers create at least 10,000 papers, data files, images, collections of field notes, and audio and video clips. The research often finds its way into professional journals, but the rest of the material remains squirreled away on personal computers, Web sites, and departmental servers. It’s accessible to only a few right now. And with computers and software evolving rapidly, the time is coming when files saved today will not be accessible to anyone at all.

Until recently there has been no overall plan to archive or preserve such work for posterity. But true to its problem-solving nature, MIT has come up with a solution. In September the Institute launched DSpace, a Web-based institutional repository where faculty and researchers can save their intellectual output and share it with their colleagues around the world and for centuries to come. The result of a two-year collaboration of the MIT Libraries and Hewlett-Packard, DSpace is built on open-source software and is available to anyone free of charge. But it’s even more important to note that many believe this groundbreaking effort will fundamentally change the way scholars disseminate their research findings.

The Case for DSpace

DSpace grew out of the mutual need of MIT’s libraries and researchers to preserve digital work, says Ann Wolpert, director of libraries for the Institute. A few years before the project began, she says, “faculty started coming to the library and saying, I have this stuff on my Web site. I want it to be more secure than it is on my computer. Will you figure out how to take my digitally formatted materials?’” Although the libraries have massive print archives dedicated to preserving a wide range of materials, they had no system for digital preservation. So Wolpert talked with the libraries’ faculty advisory committee and visited departments across campus to determine what was needed.

It “wasn’t but a blink of an eye before it became apparent that this kind of function was essential for educational technology as well as research,” says Wolpert. “If you’re going to spend all this money on online course content, where’s it all going to go?” During the meetings, it became clear that the convergence of faculty needs, the Institute’s commitment to OpenCourseWare and other campus initiatives made development of a digital archive a natural fit.

The MIT Libraries submitted a proposal to Hewlett-Packard, proposing that the two organizations form a partnership to develop a multidisciplinary digital repository. The libraries’ needs were a good match with HP’s desire to develop archival storage systems that eventually could be used in any business setting.

“The world is coming to grips with the sheer magnitude of digital content that will be produced over the next decade,” says Michael Bass, the HP project manager of DSpace. “We wanted to get to the bottom of the hard-core problems that are going to keep coming up until people address them.”

The $1.8 million project became part of the five-year, $25 million MIT-HP Alliance, a research effort to develop digital information systems. In the spring of 2000, the project team of HP software developers, MIT administrators, and a faculty advisory committee started to develop the system. The result of their efforts is a one-of-a-kind repository that can store all types of digital files and is accessible from any computer on campus. Every document stored in DSpace has a unique and permanent URL. Materials submitted to the repository are organized within a community-a school, department, lab, or center. Each community sets its standards for DSpace content and decides who will be authorized to upload its documents. Posted material with unrestricted access may be viewed by anyone.

When it first came online, DSpace could store almost a terabyte of data. While that’s enough room to accommodate the information on about 1,500 CD-ROMs, it is not large enough to hold all the work MIT faculty have stored on their own hard drives and CD-ROMs. MIT plans to add storage capacity as demand increases. “We wanted to have enough storage to bootstrap an interesting body of materials,” says Bass, “but we didn’t want to overbuild.”

DSpace is not the only digital archive in the United States, but it does occupy unique ground. “If you look at the landscape of digital repositories, there seem to be two types,” says MacKenzie Smith, associate director for technology for the MIT Libraries and the Institute’s project manager for DSpace. “One concerns library holdings that happen to be in digital format. The other is a preprint archive that is tailored to scholarly papers in a discipline and is a vehicle for getting them out quickly. They are not concerned with long-term preservation.” DSpace, however, is committed to preserving not only published papers, but also their supporting documentation.

The Challenges of Being First

Because DSpace is the first superarchive of its kind, the team had many problems to solve. It wanted to create a repository that would both serve the needs of MIT and other research universities and begin to address questions about long-term data preservation, finding solutions that would be applicable in any arena. The two goals weren’t always compatible. “There’s tension between wanting the system to work for the libraries and wanting it to work in a general sense for any kind of information or knowledge industry,” says Smith. “And there’s a tension about wanting to get something out fast and wanting to take advantage of new techniques and new technologies that aren’t quite ready for prime time.”

Robert Tansley, HP’s lead software developer on the team, describes the project as “a lot of little problems that you have to solve all at the same time.” The first and most obvious problem was the variety of applications contributors use to create their submissions. “Applications change over time,” says Tansley, “so people have different versions of things, different operating systems, and a lot of them don’t talk to each other.” To address that problem, the libraries developed a way to catalog formats for which MIT has the specifications and is, therefore, able to develop software that converts files to other formats as needed.

The team also addressed the search function needs of DSpace users. The system needed to make it easy for people to find their way through the millions of documents that will end up in DSpace. The developers selected Lucene, an open-source search engine that can index so-called metadata as well as text and can be extended with additional sophisticated search capabilities. The team also puzzled over ways people from different communities could describe their documents using the conventions of their own disciplines and still provide easy access to users outside of those communities. DSpace now uses Dublin Core, an established standard for creating the metadata that describe the documents in DSpace, but the team is looking to future research for a better solution. Through another joint venture, MIT and HP will lead the way in this area of digital archiving. A three-year project will explore how to provide metadata that are customized to specific disciplines but searchable and manageable across the entire system.

There were other issues as well. The team had to develop distinct levels of authorization so that a range of access privileges could make specific materials open to the general public or restricted to the Institute, or to an even smaller group. The system needed to be flexible enough that each organizational segment of MIT could develop its own method for submitting documents. And, it had to interoperate, or share content seamlessly, with other institutional archives. To make DSpace inexpensive to upgrade, Tansley divided it into exchangeable modules that can be replaced as new versions become available.

Just about every technical challenge presented Smith and her part of the project team with a corresponding policy question. How the material will be made available to future users was one of the biggest issues the team had to tackle. To make sure that documents will be readable on computers of the future, the team developed a list of supported formats with the requirement that the libraries will keep them available and readable in the future. For unusual formats, the libraries guarantee bit preservation, that is, storing the ones and zeroes of the original documents. “If you’ve got the know-how to reverse engineer it and maybe write a compiler for the year 2050, then you’ll be able to do something with that content,” says Smith.

Whether to allow for content removal or modification posed another policy dilemma. “In archives, you never get rid of things,” says Smith. But faculty wanted a way to suppress early, prepublication manuscripts. As a compromise, the team created a “tombstone” to acknowledge a preliminary document that did exist but is no longer available to the public. The document, however, remains in DSpace.

Other policy decisions covered what can be put into the online repository, what happens to the materials of a center or lab that closes, how to assign space to individual communities, and what extra services (such as scanning old papers) the library would provide.

Last April, in order to test the process and to provide feedback for improvements before the system went public in September, four representative MIT communities began to submit materials to DSpace. The early adopters were the Department of Ocean Engineering, the Center for Technology, Policy, and Industrial Development, the Laboratory for Information and Decision Systems, and the Sloan School of Management.

Don Lessard, deputy dean of the Sloan School, says the school volunteered to help test the system because “we think DSpace is going to be the key mechanism for maintaining and distributing research. We want a friendly portal into our research for academics, management professionals, and journalists.”

Though Lessard was receptive to the new system, he and other early adopters were concerned that some faculty would resist using DSpace. Some communities have been slower than others to begin using the repository, but outreach from DSpace staff and the ease of posting documents have lowered barriers to use. Early adopters also made special efforts to encourage faculty and researchers to submit to DSpace.

Reaching Out to Others

Having a digital superarchive makes MIT’s intellectual output available to anyone, anywhere, at any time, but the greatest value of DSpace may be in revolutionizing the way research is communicated and disseminated.

Even in this era of digital media, the vast majority of scholarly material at most universities goes unshared. But once DSpace is up and running, it will serve as a portal not only to MIT research, but also to research at partnering institutions. To test this possibility, MIT has entered into a federation with five other research institutions-Columbia University, Ohio State University, and the universities of Washington, Toronto, and Rochester-which will become the early adopters from outside the Institute. More than 30 other institutions have lined up to install DSpace on their campuses once the system proves itself.

The implications for such collaborations are mind-boggling. Researchers who want to stay current with their colleagues’ work will no longer have to wait for conferences or journal publications. Discussions of new ideas can flow unimpeded. James Neal, vice president for information services at Columbia says, “DSpace gives us a vision and a well-developed strategy. It gives us a new tool for our faculty and colleagues to communicate around the world.”

At Cornell University, Robert Cooke, dean of the faculty, says that his school will bring DSpace to campus by the end of this year. “I think the archival function of DSpace will be wildly successful,” he says. “Faculty have collected huge amounts of data that are not publicly accessible. We are genuinely impressed with DSpace. They’ve done things right.”

DSpace is also hailed as the beginning of a shift on the scholarly publishing front. Many journals refuse to publish important work that has appeared in any public arena, including an institutional repository or a Web site. But the “general trend is moving toward allowing a copy to be online on a Web site or in an archive,” says Margret Branschofsky, the faculty liaison on the DSpace project team. MIT team project members and the advocacy group Scholarly Publishing and Academic Resources Coalition, as well as representatives at many institutions, hope that changes will come to the traditional publishing system as digital archives proliferate. “We don’t think of repositories replacing journals,” says Rick Johnson, enterprise director of the coalition. “In the near term, they are complements.” But there is no doubt they would compete as well. Copyright agreements offered by journals will have to change, allowing faculty to retain the right to archive their papers in institutional repositories.

Having met the challenges of creating a digital archive, MIT and HP will continue to improve DSpace through their new metadata project and the federation with other institutions.

For Cornell’s Cooke, MIT’s attention to the needs of its faculty, the open architecture of DSpace, the federation design, and its decentralized nature comprise “a genius stroke.” DSpace, he says, is “destined to fundamentally reshape and enhance the way research universities and their faculties function.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.