A Time to Share
It’s one thing to advocate widespread data sharing; achieving it is another. Even with access to a secure network, for example, local police officials and other municipal officials may not have enough storage capacity or processing power to make good use of the data. But maybe such investigators could tap into the network itself for the resources they need. During the past few years, researchers have made rapid progress in “distributed computing,” which uses high-speed Internet connections to link many computers so they work together as one machine. Grid computing, for instance, links mainframes together, and peer-to-peer systems similarly unite personal computers. But for all users, even if their own computers are terribly underpowered, the bottom line is much the same: so long as they can plug into the network, they will be able to access data files, analytical tools, and raw computational power as easily as if they were sitting at the world’s greatest supercomputer. In a sense, they will be.Provision of ample processing power will not completely alleviate the problems of information sharing, however. First, there’s a technical issue: old databases (and even many new ones) store information in myriad incompatible formats. Then there’s a policy issue: a long list of organizations control critical databases, and the overseers of many of them believe that their files contain information too sensitive to release to outsiders.
The trick is to share the information anyway. Techniques to make this feasible are under development. Innovation is especially active on systems for “structured” data: the kind in which names and numbers fit neatly in rows and columns, as in a spreadsheet or airline reservations database. One approach, advocated most notably by database giant Oracle, is to convert everything into a standard format and store it all in a common data warehouse. This approach has the virtue of simplicity, and because it allows for efficient searching, it is widely used for commercial data mining-the systematic processing of data to find useful pieces of information. Indeed, it is the approach Systems Research and Development uses for its system, which can monitor thousands of data sources in real time.
“Our largest installation to date was for a large corporation that wanted to aggregate data on customers and has 4,200 daily feeds,” Jonas says. That’s about nine million transactions per day. As each transaction occurs, the data it generates are translated into a standard format-typically one based on extensible markup language, an information-rich variant of the hypertext markup language used to code most Web pages. The proprietary software further refines the data using a process that tries to determine whether records for, say, “Bob Smith,” “Rob Smith,” and “Robert Smith” represent three people, two, or one. “Often the differences are unintentional,” says Jonas. “But sometimes it’s people trying to obfuscate who they are.” Finally, he says, the algorithm looks for correlations that might relate one person to someone on a terrorist watch list. Results are generated in about three seconds. “Imagine that you have this ocean of data,” he says. “When each new drop comes in, we can see all the ripples.”
For one client, a large U.S. retailer facing nearly $40 million per year in fraud-related losses, Jonas’s software examined the records of the company’s thousands of employees and identified 564 who had relationships with either a company vendor-a possible source of illegal collusion-or a person the retailer previously had had arrested. And in a trial conducted for a retail association in Washington, the system turned up evidence of a ring of underage thieves who had been shoplifting a popular brand of jeans.
The company now has installations running at the FBI and several other government agencies, Jonas says. Most of that deployment is new since September 11. But if one of these had been monitoring the appropriate data feeds in late August 2001, he says, it might have noticed that using their real names and addresses, two men on the terrorist watch list of the U.S. Department of State had purchased airline tickets for American Airlines flight 77, which hijackers would crash into the Pentagon. Noting that coincidence, the system might have checked to see whether anyone else was using the addresses, frequent-flier information, and phone numbers provided by the two men.
Continuing in this way, the system might have created a chain of associations leading to many of the other 19 hijackers, including ringleader Mohammad Atta. Without the advantage of hindsight, of course, such a chain would not necessarily have screamed, Terrorists! But it might have put analysts on alert. And, with several days remaining before the scheduled flight, they might have turned up other links to such puzzling information not available online as a common interest in flight schools. Maybe-just maybe-agents might have stopped the conspirators at the gate on the morning of September 11.
Cracking the Case: A dirty bomb is set to go off in Times Square on New Year’s Eve. Plotters leave clues scattered around the world.