Taming the Data Tsunami
One of the claimants to the title of the world’s largest database sits on the edge of the Stanford University campus, connected to a three-kilometer-long particle accelerator. Housing records of the millions upon millions of elementary-particle interactions that occur in the accelerator, the BaBar database, as it is known, contains more than 680 terabytes of information-equivalent to a stack of copies of the Bill of Rights some 21,000 kilometers high. (A terabyte is 10 12 bytes.) From a data-gathering viewpoint, the Stanford experiment is a nightmare. The accelerator smashes electrons and positrons into each other at almost the speed of light, creating an explosion of data in a few trillionths of a second-vastly more input than any computer network can handle. To make sense of these overwhelming infobursts, BaBar engineers have developed a variety of techniques for containing the flow of data. These techniques will almost certainly be used by the enormous surveillance archives of tomorrow, suggesting both how they will function and how-just possibly-they might be regulated for the public good.
Rather than trying to absorb the entire river of readings from the particle collisions, the sensors in the BaBar particle detector record just a few specific aspects of selected events, discarding millions of data points for every one kept. That small sip of raw data-about a gigabyte every few minutes of running time-is still too much for physicists to study, says Jacek Becla, the lead designer of the database. To further distill the observations, the detector’s software intensively “preprocesses” the selected measurements, reducing each to a relative handful of carefully checked, easily manipulable numbers before incorporating them into the database.
Even after preprocessing, a data set can still be too big to examine efficiently in a single central locality. As a result, large databases often divide their work into smaller pieces and distribute the resulting tasks among hundreds or thousands of machines around a network. Many of these techniques were first implemented on a large scale by SETI@Home, a massively distributed system that hunts for alien civilizations. SETI@Home takes in radio telescope readings, breaks the data into chunks, and uses the Internet to dole them out to the home computers of more than four million volunteers. When these computers are otherwise idle, they run a screensaver-like program that probes the data for signs of sentient life.
As the extraordinary measures taken by BaBar and SETI@Home suggest, large databases face inherent problems. Simply running the routine comparisons that are intrinsic to databases takes much longer as data become more complex, says Piotr Indyk, a database researcher at MIT. Worse, he says, the results are often useless: as the data pool swells, the number of chance correlations rises even faster, flooding meaningful answers in a tsunami of logically valid but utterly useless solutions. Without preprocessing and distributed computing, the surveillance databases of tomorrow will drown in their own input.
It is, perhaps, unexpected that both preprocessing and distributed computing also exemplify ways the structure of databases might provide levers to control their use-if people want them. For privacy advocates, surveillance raises two critical issues: lack of accountability and the specter of information collected for a benign purpose being used for another, perhaps sinister, end. “Time and time again, people have misused this kind of data,” says Peter G. Neumann, a computer scientist at SRI, a nonprofit research organization in Menlo Park, CA. To discover when users have overstepped or abused their privileges, he says, “accountability as to who is accessing what, altering what data, not updating stuff that should have been corrected, et cetera, is absolutely vital.”
Such monitoring is already standard operating procedure in many large databases. SETI@Home, for instance, tracks exactly which of its millions of member computers is examining which datum-not least because the system, according to Berkeley computer scientist David Anderson, its designer, sends dummy data to users during the 10 to 15 percent of the time it is down, and therefore needs to monitor what is real. Nonetheless, Neumann says, most commercial database programs don’t securely record the usage data they collect. With off-the-shelf database software from Oracle, IBM, and Microsoft, he says, “there is no way” that such large surveillance databases as the Terrorist Threat Integration Center “could get accountability in any meaningful sense.” The software simply allows for too many “trusted users”-people who have full access to the system and can modify audit trails, thus deleting their tracks from the logs. The possibility of meaningful accountability does exist-but people must demand it.
Similar logic applies to the fear that data collected for one purpose will be misused for another. Consider, for example, the program in London, England, that levies a 5 ($8) “congestion charge” on each vehicle crossing into the central city. To enforce collection, the city uses hundreds of digital video cameras and character recognition software to read the license plate of every vehicle crossing into the fee area. Plate numbers are matched against the list of drivers who have paid up; noncompliant vehicle owners receive summonses in the mail. Just before the program’s launch, newspapers revealed that the images would be given to police and military databases, which will use face recognition software to scan for criminals and terrorists-an example of what privacy activists decry as “feature creep.” Observes Marc Rotenberg, executive director of the Electronic Privacy Information Center in Washington, DC, “They say they’re taking your picture to stop traffic jams. Then all of a sudden they’re trying to find out if you’re a terrorist.”
As all this suggests, repurposing surveillance information is subject to so many pitfalls that “we need to build restrictions on the way data are used,” says Lawrence Lessig, a Stanford University law professor who is the author of Code and Other Laws of Cyberspace . Ideally, in Lessig’s view, “you’d want to have a situation like what goes on with credit reports-we can see them, and know something about who is using them and why, and potentially remove any errors.”
The technology to provide such protections is already emerging. The Malaysian government is rolling out a multifunction smart card with 32 kilobytes of memory that can store up to seven types of data, including details about a person’s identity, driver’s license, bank account, and immigration status. Embedded software encrypts and compartmentalizes the information and keys it to the cardholder’s biometric data, ensuring that when an authorized government or business official accesses one type of data, the other types remain off-limits ( see “A Smart Way to Protect Privacy,” p. 1 ). If introduced into the United States, such cards could be set to tell bartenders that their bearers “are over 21 and can drink alcohol; but that’s all,” explains Lessig. “And if a police officer stops you, the card should only tell her that you have a valid driver’s license”-and not, say, your Social Security number.
The same kinds of access controls should be applied to large, centralized databases, Lessig believes. Users logging onto a sensitive database should have to identify themselves, and their access should be restricted solely to data they are authorized to examine. To further deter misuse, the database should preserve a record of its users and their actions. Such precautions are not only technically feasible but, to Lessig’s way of thinking, simply good policy. Still, he sees “next to no chance” that such precautions will be implemented, because terrorist attacks have changed the government’s attitude toward privacy and because ordinary people have demonstrated their willingness to embrace the technology without understanding the consequences.