Michael Lynch, the cofounder and chief executive of Autonomy, built Britain’s largest software company by solving a hard problem in computer science: how do you find something obscure within unstructured data–that is, within information not organized into fields that are recognized by databases (whether text, audio, or video), which constitutes most of what has ever been digitally recorded? Autonomy’s technology (which is licensed by diverse organizations) supplements traditional search methods with pattern recognition techniques derived from Bayesian inference, an abstruse form of statistical analysis. He spoke with TR’s editor in chief, Jason Pontin.
TR: Why do I care about unstructured information?
Michael Lynch: Because we are human beings, and unstructured information is at the core of everything we do. Most business is done using this kind of human-friendly information. About 85 percent of the information inside a business is unstructured.
TR: Why is searching unstructured information such a problem?
ML: With structured information, you can simply ask, Does A equal B? With unstructured information, you have a more complex situation. You have the concept of ideas not matching but having nearness to each other: in one sense “dog” doesn’t equal “Labrador,” but in another sense it does. It’s a very difficult area for computers to understand.
TR: Why not use Boolean logic [which search engines use to interpret keyword queries]?
ML: In order to construct a Boolean query, you have to be quite skilled–and you have to know what you’re looking for. Let’s say that we wanted a computer to spot all articles about Apple. We could look for the ticker symbol. But there would be lots of articles about Apple without that. So we might have “Apple + computer” or “Mac + computer” and “not apple + tree” and “not apple + fruit,” etc. Pretty soon, we’d end up with a very complex expression. But the real problem is that as we create this kind of construction, the world changes–and suddenly we have to edit our complex expression and put “iPhone” in it.
TR: Why can Bayesian inference, with its foundation in probability, search unstructured data better?
ML: There have been two attempts to create systems that can learn how ideas relate to each other without those ideas having to be predefined. The first is very intuitive: it uses semantic methods. The computer understands the rules of grammar, and it sort of analyzes things. But there’s a fundamental problem. If I said to you, “The dog walked into the room and it was furry,” you can define the “it.” But you have some knowledge. You know that, statistically, dogs are more likely to be furry than rooms. So the people who work on these problems get into situations where they have PhDs sitting in back rooms defining that dogs have the property of furriness. And that starts to fall apart because the relationships between ideas are not absolute; they’re conditional. Now, the other approach–which we use–is counterintuitive: you treat the whole thing as a mathematical problem.