Start Making Sense
As powerful as they are, such tools as data mining help only to collect and refine the dots. Federated systems, for their part, help only to share these clues. And neither approach helps connect the dots-or piece together an understanding of what a cadre of terrorists or another criminal group is up to. This is perhaps the most difficult aspect of sensemaking. Working from fragmentary clues to develop an understanding of the how and why of a crime is what detectives do. It’s also what scientists do, as they consider experimental evidence and develop a hypothesis that explains a phenomenon. In many ways, sensemaking is an essentially human process-one that’s not going to be automated anytime soon, if ever. “You need good human judgment,” emphasizes In-Q-Tel’s Louie, “and an ability to draw sensible conclusions,” qualities that must be built on knowledge of history, religion, culture, and current events.Still, technology can help. For example, consider how many clues (and indeed, how many of the insights needed to make sense of them) have their origins not in structured information, but in the vast realm of so-called unstructured data, which range from text files and e-mail to CNN feeds and Web pages. And “vast” is the word for it; when fully digitized, CNN’s video archives alone will require some four petabytes of storage, according to IBM, which is performing the conversion. That’s the equivalent of about a million PC disk drives. No human being can hope to read, view, or listen to more than the tiniest fraction of the world’s unstructured information. Even the best search engines are blunt tools. Do a Google search on “bonds,” for instance: you’ll get about five million hits on pages related to municipal bonds, chemical bonds, Barry Bonds, and a slew of other concepts.
What you really want is to have only the documents that interest you automatically find their way to you, says Ramana Venkata, founder and chief technical officer of Stratify, a Mountain View, CA, company funded by In-Q-Tel. Ideally, the documents would do more than just announce their existence. “They should put themselves into the context of other documents or of key historical trends,” Venkata asserts. And, of course, they should prioritize themselves, identifying which are most important, so that you don’t drown in information. For now, that’s still a pipe dream. But in the past few years, Stratify and a number of other companies have taken steps in that direction with their development of information classification software.
Even before September 11, Stratify’s Discovery System had been in use by the CIA and other intelligence agencies, Venkata says. To understand how it works, he suggests, imagine that somebody hands you a 30-gigabyte disk drive that is discovered in a cave in Afghanistan and asks you to figure out what’s on it. Or maybe you’ve downloaded a big collection of documents from a Web search. However you get them, he says, “the idea is to understand and organize the documents in terms of the topics and ideas they refer to, not just the specific words they contain.”
The first step, says Venkata, is to develop a taxonomy for the collection-a kind of card catalog that assigns each document to one or more categories. A taxonomy for information about aircraft, for example, might have subcategories for “helicopters” and “fixed wing,” with the latter subdivided into “fighters,” “bombers,” “transports,” and so on. Stratify’s software is set up to use or modify a standard taxonomy, says Venkata. Some organizations already have them. (Over the years, librarians have developed elaborate schemes for classifying information about biology, engineering, and countless other fields.)
But the software also can generate taxonomies automatically. Using proprietary algorithms, it scans through all the documents to extract their underlying concepts. On the basis of their conceptual similarity, it groups the documents into clusters, then links these clusters into larger clusters defined by broader concepts, continuing until every cluster is linked into a single taxonomy.
Furthermore, notes Venkata, the system assigns each document (and every new document that is added) to its appropriate place within the taxonomy. To ensure accuracy, Stratify’s software pools the results of multiple sorting procedures and algorithms, which range from manual sorting and supervised machine learning (“This document is a good example of that category; look for others like it”) to matching the statistical distribution of words. When appropriate, the system assigns documents to more than one category.
Of course, the user can edit the machine’s results at any time. But the machine-generated taxonomy provides an efficient and effective way to understand what a collection of documents is about. After all, Venkata says, “taxonomies are not an end to themselves. They’re tools to help you deal with huge amounts of information in a much more intuitive, natural way-to see patterns when you don’t know what you’re looking for.”