A few years ago, says Jeff Jonas, a friend arranged for him to give a talk at the secretive National Security Agency, widely renowned as the most technology-savvy spy shop in the world. He wasn’t quite sure what to expect. “I had never even set foot in Washington,” says Jonas, founder and chief scientist of Systems Research and Development, a Las Vegas maker of custom software that was being used by casinos and other companies to screen employees and prevent theft. True, Jonas was proud of NORA, his company’s Non-Obvious Relationships Awareness analytic software. The system can cross-correlate millions of transactions per day, extracting such items of interest as the info nugget that a particular applicant for a casino job has a sister who shares a telephone number with a known underworld figure. But Jonas reckoned that this would seem like routine stuff to the wizards of the NSA.
Wrong. “I was shocked,” Jonas says. After his talk, several members of the audience told him that his technology was more sophisticated than anything the NSA had. And now Systems Research and Development has several government customers. Indeed, he says, “since September 11, the urgency has really peaked.”
But maybe Jonas shouldn’t have been shocked. There are many explanations for the failure of the U.S. Central Intelligence Agency, the Federal Bureau of Investigation, and their fellow intelligence agencies to “connect the dots” in time to stop the terrorist attacks. The list of reasons could start with the well-known inability of these organizations to communicate. But their analysts’ out-of-date tool kit surely didn’t help. Over the past decade, the business market has seen extraordinary advances in data mining, information visualization, and many other tools for “sensemaking,” a broad-brush term that covers all the ways people bring meaning to the huge volumes of data that flood the modern world. And yet, in a major study released last October, the Markle Foundation’s Task Force on National Security in the Information Age emphasized that “we have not yet begun to mobilize our society’s strengths in information, intelligence, and technology.”
That’s not quite fair. The mobilization has begun, albeit in piecemeal, internecine fashion. Individual agencies have been eager customers for the new technologies for several years. And since 1999 the CIA has been funding some of the most promising sensemaking companies (including Jonas’s) through In-Q-Tel-the agency’s own Arlington, VA-based venture capital firm. What’s more, in early 2002 the U.S. Defense Advanced Research Projects Agency upped the ante by systematically developing sensemaking technology through its controversial new Information Awareness Office. But the problem, says the Markle task force, is that because each of the agencies is so intent on obtaining its own intelligence and buying its own technology, there has been no overall planning or coordination. Nor has a significant fraction of the annual $38 billion budgeted for homeland defense been devoted to building a capacity for sharing information or integrating its analysis.
That won’t do. The era inaugurated with such fury by the assault of September 11 imposes a technological imperative: put the pieces of the data gathering and analysis machinery together. We must mobilize the nation’s strengths in networking and analytical technology to create what the Markle task force calls a virtual analytic community: a 21st-century intelligence apparatus that would encompass not just the agencies in Washington, but also private-sector experts, local officials, and even ordinary citizens. The cold war was a mainframe-versus-mainframe confrontation, but the war against terrorism pits the United States against a network. It’s time to take intelligence gathering and interpretation into the network age.
Collecting the Dots
“We’re entering a new era of knowledge management,” declares In-Q-Tel CEO Gilman Louie. “Call it the era of chaos or complexity or whatever trendy term you want. We’re looking at a level of integration that has never been contemplated before in government.”
It’s too soon to know how (or whether) officials in Washington will approach this challenge. Any effort will most likely be spearheaded by the U.S. Department of Homeland Security, which is still in the process of organizing itself. Nevertheless, it certainly is not too soon to take a look at how such a virtual intelligence system might work and how it might use technology most effectively.
Before we can connect the dots, we first have to collect the dots. And in the war on terrorism, notes the Markle report, “most of the people, information, and action will be in the field.” Critical clues can come from unexpected places: “a cop hearing a complaint from a landlord, an airport official who hears about a plane some pilot trainee left on a runway, an FBI agent puzzled by an odd flight-school student in Arizona, or an emergency room resident trying to treat patients stricken by an unusual illness.”
Likewise, in most cases, most of the expertise required to interpret particular pieces of intelligence or to devise responses will reside with local officials and other agencies outside the centralized-intelligence community. If, for example, a town is facing a threat to its water system, an appropriate response team might include state officials and local hospitals, as well as public utility commissioners, building inspectors, and watershed conservationists. So the virtual community’s most basic requirement will be an online meeting space that’s open to any and all such officials. Such a system could be implemented as a virtual private network running on top of the Internet, with standard encryption techniques providing the security. Many companies have been operating virtual private networks for several years.
But simple communication is only part of what’s needed. Investigators from myriad federal, state, and local agencies will also have to share data quite freely, if only because there’s no way to know in advance what information will be relevant to whom. This commonsensical notion is not the norm in conventional intelligence agencies, where to protect sources and methods, information is kept tightly compartmentalized. Yet information sharing among employees is increasingly common in the corporate world; open communication helps companies become much more flexible, innovative, and responsive to customers. And it is even taking root in the Pentagon, where information sharing is known as network centric warfare. A prime example: the Afghanistan war, during which everyone involved-imagery analysts, fighter pilots, and experts on Afghanistan itself-had access to the same data and could interact in real time.
A Time to Share
It’s one thing to advocate widespread data sharing; achieving it is another. Even with access to a secure network, for example, local police officials and other municipal officials may not have enough storage capacity or processing power to make good use of the data. But maybe such investigators could tap into the network itself for the resources they need. During the past few years, researchers have made rapid progress in “distributed computing,” which uses high-speed Internet connections to link many computers so they work together as one machine. Grid computing, for instance, links mainframes together, and peer-to-peer systems similarly unite personal computers. But for all users, even if their own computers are terribly underpowered, the bottom line is much the same: so long as they can plug into the network, they will be able to access data files, analytical tools, and raw computational power as easily as if they were sitting at the world’s greatest supercomputer. In a sense, they will be.
Provision of ample processing power will not completely alleviate the problems of information sharing, however. First, there’s a technical issue: old databases (and even many new ones) store information in myriad incompatible formats. Then there’s a policy issue: a long list of organizations control critical databases, and the overseers of many of them believe that their files contain information too sensitive to release to outsiders.
The trick is to share the information anyway. Techniques to make this feasible are under development. Innovation is especially active on systems for “structured” data: the kind in which names and numbers fit neatly in rows and columns, as in a spreadsheet or airline reservations database. One approach, advocated most notably by database giant Oracle, is to convert everything into a standard format and store it all in a common data warehouse. This approach has the virtue of simplicity, and because it allows for efficient searching, it is widely used for commercial data mining-the systematic processing of data to find useful pieces of information. Indeed, it is the approach Systems Research and Development uses for its system, which can monitor thousands of data sources in real time.
“Our largest installation to date was for a large corporation that wanted to aggregate data on customers and has 4,200 daily feeds,” Jonas says. That’s about nine million transactions per day. As each transaction occurs, the data it generates are translated into a standard format-typically one based on extensible markup language, an information-rich variant of the hypertext markup language used to code most Web pages. The proprietary software further refines the data using a process that tries to determine whether records for, say, “Bob Smith,” “Rob Smith,” and “Robert Smith” represent three people, two, or one. “Often the differences are unintentional,” says Jonas. “But sometimes it’s people trying to obfuscate who they are.” Finally, he says, the algorithm looks for correlations that might relate one person to someone on a terrorist watch list. Results are generated in about three seconds. “Imagine that you have this ocean of data,” he says. “When each new drop comes in, we can see all the ripples.”
For one client, a large U.S. retailer facing nearly $40 million per year in fraud-related losses, Jonas’s software examined the records of the company’s thousands of employees and identified 564 who had relationships with either a company vendor-a possible source of illegal collusion-or a person the retailer previously had had arrested. And in a trial conducted for a retail association in Washington, the system turned up evidence of a ring of underage thieves who had been shoplifting a popular brand of jeans.
The company now has installations running at the FBI and several other government agencies, Jonas says. Most of that deployment is new since September 11. But if one of these had been monitoring the appropriate data feeds in late August 2001, he says, it might have noticed that using their real names and addresses, two men on the terrorist watch list of the U.S. Department of State had purchased airline tickets for American Airlines flight 77, which hijackers would crash into the Pentagon. Noting that coincidence, the system might have checked to see whether anyone else was using the addresses, frequent-flier information, and phone numbers provided by the two men.
Continuing in this way, the system might have created a chain of associations leading to many of the other 19 hijackers, including ringleader Mohammad Atta. Without the advantage of hindsight, of course, such a chain would not necessarily have screamed, Terrorists! But it might have put analysts on alert. And, with several days remaining before the scheduled flight, they might have turned up other links to such puzzling information not available online as a common interest in flight schools. Maybe-just maybe-agents might have stopped the conspirators at the gate on the morning of September 11.
Cracking the Case: A dirty bomb is set to go off in Times Square on New Year’s Eve. Plotters leave clues scattered around the world.
Illustration by John MacNeill.
We can never know what might have been. In the meantime, however, we must still contend with the problem of very sensitive data. The owners of certain data don’t want to let their information anywhere near a common data warehouse. In such cases, the notion of “federated” information suggests an approach that may be more applicable.
“It’s a way to have integration without centralization,” explains Nelson Mattos, director of information integration at IBM, which is forcefully developing this approach. The idea is to leave each data repository where it is-forget the warehouse-and instead provide it with a “wrapper”: a piece of software that knows how to translate the inner workings of that particular database to and from a standardized query language. This way, says Mattos, “a single query can be sent out to access all forms of information, wherever it’s stored.” The answers that come back are arrayed on the user’s screen in neat rows and columns, as if, he explains, they had been “all stored in a single database.”
This architecture solves the compatibility problem automatically, Mattos says. It also helps to ensure that sensitive information goes only to the right people. In a federated system, he says, “I continue to own my data. But I can write a wrapper that allows outsiders to access parts of my data, without giving away the whole thing.” The wrapper around a medical database might answer queries from public health officials about statistical information while denying access to information about specific patients. Likewise, the wrapper around an intelligence database might protect sources and methods. Such a system might be augmented by digital rights management, which could make it impossible to copy a digitally shared document.
Start Making Sense
As powerful as they are, such tools as data mining help only to collect and refine the dots. Federated systems, for their part, help only to share these clues. And neither approach helps connect the dots-or piece together an understanding of what a cadre of terrorists or another criminal group is up to. This is perhaps the most difficult aspect of sensemaking. Working from fragmentary clues to develop an understanding of the how and why of a crime is what detectives do. It’s also what scientists do, as they consider experimental evidence and develop a hypothesis that explains a phenomenon. In many ways, sensemaking is an essentially human process-one that’s not going to be automated anytime soon, if ever. “You need good human judgment,” emphasizes In-Q-Tel’s Louie, “and an ability to draw sensible conclusions,” qualities that must be built on knowledge of history, religion, culture, and current events.
Still, technology can help. For example, consider how many clues (and indeed, how many of the insights needed to make sense of them) have their origins not in structured information, but in the vast realm of so-called unstructured data, which range from text files and e-mail to CNN feeds and Web pages. And “vast” is the word for it; when fully digitized, CNN’s video archives alone will require some four petabytes of storage, according to IBM, which is performing the conversion. That’s the equivalent of about a million PC disk drives. No human being can hope to read, view, or listen to more than the tiniest fraction of the world’s unstructured information. Even the best search engines are blunt tools. Do a Google search on “bonds,” for instance: you’ll get about five million hits on pages related to municipal bonds, chemical bonds, Barry Bonds, and a slew of other concepts.
What you really want is to have only the documents that interest you automatically find their way to you, says Ramana Venkata, founder and chief technical officer of Stratify, a Mountain View, CA, company funded by In-Q-Tel. Ideally, the documents would do more than just announce their existence. “They should put themselves into the context of other documents or of key historical trends,” Venkata asserts. And, of course, they should prioritize themselves, identifying which are most important, so that you don’t drown in information. For now, that’s still a pipe dream. But in the past few years, Stratify and a number of other companies have taken steps in that direction with their development of information classification software.
Even before September 11, Stratify’s Discovery System had been in use by the CIA and other intelligence agencies, Venkata says. To understand how it works, he suggests, imagine that somebody hands you a 30-gigabyte disk drive that is discovered in a cave in Afghanistan and asks you to figure out what’s on it. Or maybe you’ve downloaded a big collection of documents from a Web search. However you get them, he says, “the idea is to understand and organize the documents in terms of the topics and ideas they refer to, not just the specific words they contain.”
The first step, says Venkata, is to develop a taxonomy for the collection-a kind of card catalog that assigns each document to one or more categories. A taxonomy for information about aircraft, for example, might have subcategories for “helicopters” and “fixed wing,” with the latter subdivided into “fighters,” “bombers,” “transports,” and so on. Stratify’s software is set up to use or modify a standard taxonomy, says Venkata. Some organizations already have them. (Over the years, librarians have developed elaborate schemes for classifying information about biology, engineering, and countless other fields.)
But the software also can generate taxonomies automatically. Using proprietary algorithms, it scans through all the documents to extract their underlying concepts. On the basis of their conceptual similarity, it groups the documents into clusters, then links these clusters into larger clusters defined by broader concepts, continuing until every cluster is linked into a single taxonomy.
Furthermore, notes Venkata, the system assigns each document (and every new document that is added) to its appropriate place within the taxonomy. To ensure accuracy, Stratify’s software pools the results of multiple sorting procedures and algorithms, which range from manual sorting and supervised machine learning (“This document is a good example of that category; look for others like it”) to matching the statistical distribution of words. When appropriate, the system assigns documents to more than one category.
Of course, the user can edit the machine’s results at any time. But the machine-generated taxonomy provides an efficient and effective way to understand what a collection of documents is about. After all, Venkata says, “taxonomies are not an end to themselves. They’re tools to help you deal with huge amounts of information in a much more intuitive, natural way-to see patterns when you don’t know what you’re looking for.”
Seeing is Believing
Once a system or framework is in place, other technologies can help keep track of the multitude of clues, not to mention the enormous array of possible interpretations. A good example is Analyst’s Notebook, a set of visualization tools developed over the past decade by a Cambridge, England-based company called i2. Originally created for police and insurance investigators, Analyst’s Notebook has recently attracted a following among U.S. intelligence agencies. Indeed, the company claims that President Bush is regularly briefed with charts from i2.
Right after September 11, says Todd Drake, i2’s U.S. sales manager, “we were asked to come help out at FBI headquarters,” where a hurriedly convened interagency intelligence group was struggling to get up and running. “Our software was already heavily deployed inside both the FBI and the Defense Intelligence Agency. So we saw cases where a DIA guy would walk past the cube where an FBI guy was sitting and say, Hey, I know this program!’ And suddenly they’d be working together. If the bureaucratic barriers fall, technology can help cooperation happen.”
Analyst’s Notebook provides timelines that illustrate related events unfolding over days, weeks, or even years, as well as transaction analysis charts that reveal patterns in, say, the flow of cash among bank accounts. But most dramatic of all are its link analysis charts. These look a bit like the route maps published in airline magazines, with crisscrossing lines connecting cities around the world. But the “cities” in these charts are symbols that represent people, organizations, bank accounts, and other points of interaction. (An i2 demonstration chart that traces only publicly available information shows the links that tie the September 11 hijackers to Osama bin Laden.)
Each element in a chart is hyperlinked to the evidence that supports it; that evidence, in turn, is tagged with such additional information as sources, estimated reliability, and security levels. If it later becomes apparent that information from a particular source is in fact disinformation-that is, lies-a single keystroke can purge from the chart all of that source’s contributions. It is similarly straightforward to eliminate elements that are related to classified information. The “sanitized” chart, still quite useful, may be freely shared with collaborators who lack the requisite security clearances. “Why,” Drake asks, “should the Drug Enforcement Administration have to redo something the Defense Intelligence Agency already did?”
Yet another way technology can enhance sensemaking is by providing tools that help steer analysts clear of certain mental pitfalls. That’s the goal of the Structured Evidential Argumentation System, which was developed by SRI International under Project Genoa, a recently completed DARPA program. (Genoa II, a follow-on project, is now under way at DARPA’s Information Awareness Office.) According to SRI team leader John Lowrance, the argumentation system “helps you organize your thinking and keeps you from jumping to conclusions too soon.” Too often, he says, when people are struggling to make sense of fragmentary clues, they succumb to the subconscious temptation to focus on one likely interpretation, neglecting to give other possibilities the attention they merit.
To help analysts avoid such blinkered thinking, SRI’s system uses “structured argumentation,” which marshals evidence according to a specified template. Depending on the task at hand, Lowrance says, the template might take the form of a flow chart, say, or a legal argument. The tool Lowrance’s team developed appears as a hierarchy of yes-or-no questions. Someone monitoring country X might start an analysis with a high-level question: Is country X headed for a political crisis? Answering that question requires the analyst to get answers to several more specific queries, such as, Is political instability increasing? Is there a government power struggle with potentially destabilizing consequences? Each of those questions, in turn, might find its answers in terms of still more specific queries: Is there evidence of growing factionalism?
This drilling-down process continues until it generates questions that can be answered with specific pieces of intelligence, such as field reports, Internet downloads, or the output of data-mining systems. On the basis of those results, the analyst refers to a five-color scale and selects the color that conveys the degree of the answer’s certainty: red, for “almost certainly yes”; orange, for “likely”; and so on, down to green, meaning “almost certainly no.” The system automatically aggregates these conclusions into color-coded assessments of the higher-level questions. And it provides a number of displays-including a color-coded overview of the entire hierarchy-that starkly outline the areas of greatest concern in red and orange.
SRI’s tool gives “a very quick sense of what’s driving the conclusion,” says Lowrance. Perhaps its most important benefit, he says, is that because the tool communicates results specifically, rapidly, and graphically to people who are not familiar with a situation, it offers an effective alternative to writing memos. SRI is exploring the possibility of developing a commercial version.
Software companies have come up with a host of other analytical tools. In fact, with so many already available or under development, analysts are challenged to get everything to work together seamlessly. That’s why In-Q-Tel is funding Cincinnati-based Intelliseek, one of the few companies that specialize in such integration.
“There is no silver bullet,” says Intelliseek CEO Mahendra Vora. “And no one company has the complete solution to solve all our homeland security problems.” What’s needed, Vora says, is an open architecture that can incorporate new applications as they come along. For this reason, the company’s systems use open standards, such as extensible markup language, that make it easy for different pieces of sensemaking software to work in harmony. This open technical approach is, perhaps, a metaphor for the new age of homeland security itself.
For many people, mere technical protections against governmental infringement on privacy provide little reassurance. Witness last fall’s uproar in response to the news that John Poindexter, notorious for his involvement in the Iran-Contra affair, would head DARPA’S Information Awareness Office. (William Safire fulminated in the New York Times about “this master of deceit” and his “20-year dream” to snoop on every U.S. resident.) Many Americans have a visceral aversion to domestic-intelligence gathering in any form, notes Lee Tien, an attorney with the Electronic Frontier Foundation, and, he adds, they have good reason. Whenever he hears a bright new idea for high tech intelligence, he says, “My question is, What would J. Edgar Hoover do with such a system?’”
If anything can improve the level of trust, says Tien, it is even greater openness. “Skepticism grows from secrecy,” he says. “If officials are going to hide, not give any details, and just tell us they’re protecting privacy, that’s not a strategy that will reassure people.” Conversely, he says, by being as forthcoming as possible, the Department of Homeland Security and any other agency that coordinates the intelligence-gathering effort could reap big dividends in the form of public cooperation and political support.
We can only savor the irony. It is our openness as a society that makes us so vulnerable to terrorism. Yet our openness-on both the technological and the human levels-may very well be our strongest defense.
|A Sampling of Sensemakers|
|Entrieva||Reston, VA||Management of unstructured data such |
as text files, Web pages, and audio and
|i2||Cambridge, England||Information visualization software|
|IBM||Armonk, NY||“Federated” management of structured data such as an airline reservations database, as well as unstructured data|
|In-Q-Tel||Arlington, VA||CIA’s venture fund for companies with technologies that show promise for intelligence work|
|Intelliseek||Cincinnati, OH||Integration of information search and analysis tools|
|Stratify||Mountain View, CA||Management of unstructured data|
|Systems Research and Development||Las Vegas, NV||Software that finds subtle patterns in complex webs of relationships and transactions|
This new data poisoning tool lets artists fight back against generative AI
The tool, called Nightshade, messes up training data in ways that could cause serious damage to image-generating AI models.
Rogue superintelligence and merging with machines: Inside the mind of OpenAI’s chief scientist
An exclusive conversation with Ilya Sutskever on his fears for the future of AI and why they’ve made him change the focus of his life’s work.
Data analytics reveal real business value
Sophisticated analytics tools mine insights from data, optimizing operational processes across the enterprise.
Driving companywide efficiencies with AI
Advanced AI and ML capabilities revolutionize how administrative and operations tasks are done.
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.