Michael Lynch, the cofounder and chief executive of Autonomy, built Britain’s largest software company by solving a hard problem in computer science: how do you find something obscure within unstructured data–that is, within information not organized into fields that are recognized by databases (whether text, audio, or video), which constitutes most of what has ever been digitally recorded? Autonomy’s technology (which is licensed by diverse organizations) supplements traditional search methods with pattern recognition techniques derived from Bayesian inference, an abstruse form of statistical analysis. He spoke with TR’s editor in chief, Jason Pontin.
TR: Why do I care about unstructured information?
Michael Lynch: Because we are human beings, and unstructured information is at the core of everything we do. Most business is done using this kind of human-friendly information. About 85 percent of the information inside a business is unstructured.
TR: Why is searching unstructured information such a problem?
ML: With structured information, you can simply ask, Does A equal B? With unstructured information, you have a more complex situation. You have the concept of ideas not matching but having nearness to each other: in one sense “dog” doesn’t equal “Labrador,” but in another sense it does. It’s a very difficult area for computers to understand.
TR: Why not use Boolean logic [which search engines use to interpret keyword queries]?
ML: In order to construct a Boolean query, you have to be quite skilled–and you have to know what you’re looking for. Let’s say that we wanted a computer to spot all articles about Apple. We could look for the ticker symbol. But there would be lots of articles about Apple without that. So we might have “Apple + computer” or “Mac + computer” and “not apple + tree” and “not apple + fruit,” etc. Pretty soon, we’d end up with a very complex expression. But the real problem is that as we create this kind of construction, the world changes–and suddenly we have to edit our complex expression and put “iPhone” in it.
TR: Why can Bayesian inference, with its foundation in probability, search unstructured data better?
ML: There have been two attempts to create systems that can learn how ideas relate to each other without those ideas having to be predefined. The first is very intuitive: it uses semantic methods. The computer understands the rules of grammar, and it sort of analyzes things. But there’s a fundamental problem. If I said to you, “The dog walked into the room and it was furry,” you can define the “it.” But you have some knowledge. You know that, statistically, dogs are more likely to be furry than rooms. So the people who work on these problems get into situations where they have PhDs sitting in back rooms defining that dogs have the property of furriness. And that starts to fall apart because the relationships between ideas are not absolute; they’re conditional. Now, the other approach–which we use–is counterintuitive: you treat the whole thing as a mathematical problem.
TR: How so?
ML: Imagine that you took all the newspapers and books, and you cut out all the words, and put them in a black bag: you would have a random process. You would expect nothing but gobbledygook. But if we pick a real page of text, it’s not random: if we read the word “dog,” then the probability that you will see the word “walk” increases. The reason is that the process has been biased by something: the idea of the dog that was in the mind of the author of the sentence. By using Bayesian inference, you can, in fact, infer the existence of the idea behind the word and all its relationships. The wonderful thing is you inherently get context. With Bayesian systems, you understand that just because Nicole Kidman is a star doesn’t mean she’s a cosmic gas ball.
TR: Why can’t Google’s algorithms search unstructured information?
ML: Just because you’ve been very good at keyword-based, popularity-ranked search doesn’t actually buy you much advantage processing unstructured information where you have to understand meaning.
TR: You have philosophical as well as practical objections to the curatorial approach to search embraced by Wolfram Alpha (see “Search Me,” July/August 2009).
ML: Those methods can work very well in limited contexts. But there are some big philosophical problems with the idea that information is absolute in meaning and that you can classify it just one way. If you come from the probabilistic world, the first thing you learn is that you have to deal with people’s worldviews. A very simple example: a computer might classify the same news story differently if it was working for a Palestinian newspaper or an Israeli one. But there’s nothing wrong with that. This notion that all information should have the same meaning is something that we’ve been taught by the idea of objective science since the Reformation. But for lots of the tasks that people need to do, it’s perfectly acceptable that meaning should be in the eye of the beholder.