Managing the sheer size of these aggregate surveillance databases, surprisingly, will not pose insurmountable technical difficulties. Most personal data are either very compact or easily com- pressible. Financial, medical, and shopping records can be represented as strings of text that are easily stored and transmitted; as a general rule, the records do not grow substantially over time.
Even biometric records are no strain on computing systems.To identify people, genetic-testing firms typically need stretches of DNA that can be represented in just one kilobyte—the size of a short e- mail message. Fingerprints, iris scans, and other types of biometric data consume little more. Other forms of data can be preprocessed in much the way that the cameras on Route 9 transform multi-megabyte images of cars into short strings of text with license plate numbers and times. (For investigators, having a video of suspects driving down a road usually is not as important as simply knowing that they were there at a given time.) To create a digital dossier for every individual in the United States—as pro- grams like Total Information Awareness would require—only “a couple terabytes of well-defined information” would be needed, says Jeffrey Ullman, a former Stanford University database researcher. “I don’t think that’s really stressing the capacity of [even today’s] databases.”
Instead, argues Rajeev Motwani, another member of Stanford’s database group, the real challenge for large surveillance databases will be the seemingly simple task of gathering valid data. Computer scientists use the term GIGO— garbage in, garbage out—to describe situations in which erroneous input cre-ates erroneous output.Whether people are building bombs or buying bagels, governments and corporations try to predict their behavior by integrating data from sources as disparate as electronic toll-collection sensors, library records, restaurant credit-card receipts, and grocery store customer cards—to say nothing of the Internet, surely the world’s largest repository of personal information. Unfortunately, all these sources are full of errors, as are financial and medical records. Names are misspelled and digits transposed; address and e-mail records become outdated when people move and switch Internet service providers; and formatting differences among databases cause information loss and distortion when they are merged. “It is routine to find in large customer databases defective records—records with at least one major error or omission—at rates of at least 20 to 35 percent,”says Larry English of Information Impact, a database consulting company in Brentwood, TN.
Unfortunately, says Motwani, “data cleaning is a major open problem in the research community. We are still struggling to get a formal technical definition of the problem.” Even when the original data are correct, he argues, merging them can introduce errors where none had existed before.Worse, none of these worries about the garbage going into the system even begin to address the still larger problems with the garbage going out.