By one estimate, the total number of Web pages doubled to two billion unique indexable pages between January and June last year. By any estimate, searching for relevant data in this sea of mostly unlabeled information is becoming truly tough.Enter the next round of data mining tools: machine-learning algorithms for classifying and extracting information from Web pages. (“Machine learning” is the study of mathematical procedures for solving abstract problems that improve automatically through experience.)
Among those leading this charge is WhizBang Labs. Tom Mitchell, a professor at Carnegie Mellon University and WhizBang’s chief scientist, outlined its work at a meeting of the Society for Industrial and Applied Mathematics this month in Chicago.
The Origin of Search Tools
Early search tools began with keyword searching, letting you retrieve documents containing specific words or phrases. The problem is that keyword searches retrieve all documents containing the keyword or phrase but do not allow you to ask for the particular set of facts that represents your actual informational need.
The next evolution was document classification-categorizing and cataloging documents based on whole-document topics.
The current stage in search-tool evolution is “entity extraction,” automatically extracting specific types of entities, such as dates, cities, countries, person or company names, and storing collections of interrelated entities in database records.
“Once we have the ability to extract databases of this kind,” says Mitchell, “the user can search the database with queries such as ‘What is the name and phone number of each vice president mentioned in the document?’
“This kind of record extraction is where we are driving the evolution of tools for managing the information flood.”
Mitchell uses three types of algorithms to extract new database records from existing Web pages. The Naive Bayes model, described in Mitchell’s 1997 textbook Machine Learning, is a basic algorithm for calculating the probability that documents will contain targeted information based on topic-word frequencies.
Naive Bayes classifiers are among the more successful algorithms for classifying text documents. However, if keywords are not properly identified for the topic, Bayesian classifiers have difficulty producing correct results.
Improvements on Bayesian models are so-called “maximum entropy” algorithms, which can see beyond word independence to examine the frequency of association of word combinations. For example, while the words “data” and “mining” need not appear together as a phrase, there is a high likelihood that they will both be found in certain types of documents.
A limitation of maximum-entropy algorithms is that they need to be “trained” or “supervised” with initial sets of positive and negative examples.
New “co-training” models that tap additional sources of information about a Web page are the third technique-“the thing I’m most excited about,” says Mitchell.
These proprietary algorithms can capture “unlabeled” data sets with very limited training by programmers. They do this by analyzing the hyperlinks that refer to a Web page and correlating the information contained in those links to the text on the page.
“The features that describe a page are the words on the page and the links that point to that page,” says Mitchell. “The co-training models utilize both classifiers to determine the likelihood that a page will contain data relevant to the search criteria.”
In a progressive search, the algorithm remembers link information as an implied correlation, and as the possible link correlations grow, they can help to confirm a hit or a miss. The search works both ways so that text information on the page can help determine the relevance of link classifiers-hence the term “co-training”.
Co-training models reduce hit error percentages by more than half, Mitchell claims. Other algorithms have a hit-accuracy of 86 percent, while co-trained models attain 96 percent accuracy, he says.
WhizBang’s online job site, FlipDog.com, launched last year as a demonstration of its data-mining technology. Since then it has signed up clients such as Dun & Bradstreet and the U.S. Department of Labor, for which it is compiling a directory of continuing and distance education opportunities.
Adding Natural Language
The next step, says Mitchell, will be to combine advanced searching techniques with natural-language querying models like those used at AskJeeves.com.
“AskJeeves focuses on parsing the user’s language,” says Mitchell. “Their vision is the right vision, but the really hard part is on the searching end.”
When natural-language requests are successfully combined with new data mining algorithms, he says, then “you’ll have a really powerful generation of search tools.”