Managing the sheer size of these
aggregate surveillance databases, surprisingly, will not pose insurmountable
technical difficulties. Most personal data
are either very compact or easily com-
pressible. Financial, medical, and shopping records can be represented as strings
of text that are easily stored and transmitted; as a general rule, the records do
not grow substantially over time.
Even biometric records are no strain
on computing systems.To identify people,
genetic-testing firms typically need
stretches of DNA that can be represented
in just one kilobyte—the size of a short e-
mail message. Fingerprints, iris scans,
and other types of biometric data consume little more. Other forms of data
can be preprocessed in much the way
that the cameras on Route 9 transform
multi-megabyte images of cars into short
strings of text with license plate numbers
and times. (For investigators, having a
video of suspects driving down a road
usually is not as important as simply
knowing that they were there at a given
time.) To create a digital dossier for every
individual in the United States—as pro-
grams like Total Information Awareness would require—only "a couple terabytes
of well-defined information" would be
needed, says Jeffrey Ullman, a former
Stanford University database researcher.
"I don't think that's really stressing the
capacity of [even today's] databases."
Instead, argues Rajeev Motwani,
another member of Stanford's database
group, the real challenge for large surveillance databases will be the seemingly
simple task of gathering valid data. Computer scientists use the term GIGO—
garbage in, garbage out—to describe
situations in which erroneous input cre-ates erroneous output.Whether people are
building bombs or buying bagels, governments and corporations try to predict their behavior by integrating data
from sources as disparate as electronic
toll-collection sensors, library records,
restaurant credit-card receipts, and grocery store customer cards—to say nothing
of the Internet, surely the world's largest
repository of personal information.
Unfortunately, all these sources are full of
errors, as are financial and medical
records. Names are misspelled and digits
transposed; address and e-mail records
become outdated when people move and
switch Internet service providers; and
formatting differences among databases
cause information loss and distortion
when they are merged. "It is routine to
find in large customer databases defective
records—records with at least one major
error or omission—at rates of at least 20
to 35 percent,"says Larry English of Information Impact, a database consulting
company in Brentwood, TN.
Unfortunately, says Motwani, "data
cleaning is a major open problem in the
research community. We are still struggling to get a formal technical definition
of the problem." Even when the original
data are correct, he argues, merging them
can introduce errors where none had
existed before.Worse, none of these worries about the garbage going into the system even begin to address the still larger problems with the garbage going out.
Comments