Technology Review - Published By MIT
Advertisement

Building a Better Search Engine

A new natural-language system is based on 30 years of research at PARC.

By Michael Reisman

Friday, July 27, 2007

smaller text tool iconmedium text tool iconlarger text tool icon
Gravitating toward an answer: Powerset’s natural-language search engine takes context and meaning into consideration while scouring the Web for information. As a result, better answers are funneled to the user.
Credit: Technology Review

Powerset, Inc., based in San Francisco, is on the verge of offering an innovative natural-language search engine, based on linguistic research at the Palo Alto Research Center (PARC). The engine does more than merely accept queries asked in the form of a question. The company claims that the engine finds the best answer by considering the meaning and context of the question and related Web pages.
"Powerset extracts deep concepts and relationships from the texts, and the users query and match them efficiently to deliver a better search," Powerset CEO Barney Pell says.

Even though attempts have been made at natural-language search for decades, Powerset says that its system is different because it has solved some of the fundamental technological problems that have existed with this kind of search. It has done so by developing a product that is deep, computationally advanced, and still economically viable.

Pell says that it's difficult to pinpoint one particular technological breakthrough, but he believes that Powerset's superiority lies in the three decades of hard work by scientists at PARC. (PARC licensed much of its natural-language search technology to Powerset in February.) There was not one piece of technology that solved the problem, Pell says, but instead, it was the unification of many theories and fragments that pulled the project together.

"After 30 years, it's finally reached a point where it can be brought into the world," he says.

A key component of the search engine is a deep natural-language processing system that extracts the relationships between words; the system was developed from PARC's Xerox Linguistic Environment (XLE) platform. The framework that this platform is based on, called Lexical Functional Grammar, enabled the team to write different grammar engines that help the search engine understand text. This includes a robust, broad-coverage grammar engine written by PARC. Pell also claims that the engine is better than others at dealing with ambiguity and determining the real meaning of a question or a sentence on a Web page. All these innovations make the system more adaptable, he says, so that it can extract deep relationships from text.

Powerset chief technology officer Ron Kaplan has led PARC's XLE team since the 1970s and is the author of much of the technology behind XLE that has been licensed to the company. Kaplan says that he and Pell began to collaborate on the idea about two years ago.
Current methods of searching used by more traditional engines focus on isolated keywords and broad but shallow content coverage. This leaves a lot of room for improvement, Kaplan says.

"They are really not getting at relationships," he notes. "The best that they do to approximate relationships are words that are close to other words." He adds that a much deeper level of analysis is required.

Previous attempts have tried to pair some natural-language query processing with standard keyword searches of relevant content. This approach can be seen with some parts of standard search engines like Google, which, if it doesn't understand a user's query, will suggest another phrase or word that it thinks he or she may have meant. Engines such as Google and Yahoo use some components of natural-language search, yet there has not yet been a full-scale natural-language search engine for consumers. (See "The Future of Search.") Pell says that this was mainly because the necessary technology was simply not ready. Engines that use natural-language components for aspects of the search, such as iPhrase and EasyAsk, don't process textual content as Powerset does, Pell says, but instead simply query databases for answers to questions. Attempts at full natural-language search, such as that offered by Hakia and Cognition Search, do not cover as rich a representation of concepts or meaning, Pell says.

Comments

  • Old stuff
    What this people is about to create has been on the market for many years now. One of the companies that provide that technology is delphes.com
    Rate this comment: 12345

    fischia
    07/27/2007
    Posts:1
    Avg Rating:
    3/5
    • Re: Old stuff
      I have recently submitted a provisional patent entitled "A method and system for improving the relevance of the results of queries of language based searchable databases."  This patent makes it possible for the creator of information and the searcher of information to define exactly the meaning of words solving the problem of homonyms.  For example if either uses the word plane it is possible for all concerned to know specifically if the reference is to a fixed wing aircraft, a wood carving tool etc.

      The provisonal patent can be viewed by going to:

      http://chuningd.googlepages.com

      I thought this might be of interest to you.  I welcome any feedback.

      chuningd@hotmail.com

      Sincerely,

      Dennis Chuning
      P O Box 2651
      Napa, CA  94558
      Rate this comment: 12345

      CHUNINGD
      07/28/2007
      Posts:1
  • Remember the "Dewey Decimal System"?
    In a industry survey on organizing the web about a dozen years ago, I asked why not use something like the DDS for organizing web pages. The industry created the basic format and the user must enter the appropriate information. But, no, keywords were chosen instead, leaving us crafty folks to use all our competitors names in the keywords to get their customers, until it was declared illegal.
    Rate this comment: 12345

    fiberman
    07/27/2007
    Posts:80
    Avg Rating:
    3/5
  • My take
    I thought about the engine thing. And building a natural language of the computer was probably the best take.

    Yet what I am reading about it not intriguingly new. It is matter of fact old fashion.

    I mean not trying to copy the human behaviour is of course the most natural thing to do. The internet itself is much smarter in organizing its own content that is the human brain...

    It is like those folks messing around with robot brains that make them learn how to act and feel like humans while the answer is just that you have to organize any kind of whatever systemthat is able to analyse its own code and results in a memory stack, and that designs own and more reflexive feedback rings.
    The key of robot inteligence lies not in the ability of one code and robot being super intelligent, but in the ability that the experiences gained by a first generation of robots ist transferable to the next generation and that the next generation knows how to translate the experiences gained by older generations into its own system, which basically means that one most operate with different levels of code like we know it from the programming languages from assembler to such a thing as php..

    Whatever, I want to give some shot at the topic, just cause I have nothing to do and because Im bored.

    Search engines. it is not a real difficult problem to solve in the internet to re-tag/re-index the internet. Traditional literature propagates the model that one uses links, meta-tags mentioned in code and content, drop it into statistical grids and use data mining to get it smarter..

    In fact the first intelligent thing to do is to map the net. Meaning: ever action, every concent that is somehow related with one another is to be given a specific point in a hyper-sphere.

    If I have a web page, I have links on that page, I have dynamic content, I have specific coordinates in the window, and most important: I have a browser logging the human - computer interaction.

    Allowing for example - just for fun - that the user uses his own smartness to tag the internet on a browser - based protocoll that stores information given on a specific page and coordinate (a click and a tag that addresses the text or the specific word in the text) would make such a lot of sense because if you organize this behaviour and make each users work available through a social network- kind of programm that filters good content from bad content using status information and ratings, you could make the dumb user re-contextualize the whole internet tagging the content in the browser wherever he wants to...
    and of course using automated scripts the user could generate own pages using any kind of accessable and located information somewhere ni the net to display it in another window..on his "hyperpage"..

    But, this is boring, right? This does not address the search problem, because it actually changes the very way the internet is hyperlinked rather than making the old net accessible..

    So..using contexts..even without browser coordinates we can get a pretty perfect grid of information relations in texts. The former inventor of the web, yall might know him.. actually solved a kind of  a problem by adding a code to each paragraph of text written to tag its origin and to make it entirely genuine. This would solve the problem, too. But we dont even have that technology ..so what do we do ?
    We make a dumb net more intelligebly searchable..

    Still we have links, we have meta-information in the page, and we even can use fast algorithms to get a grip of the gist of a specific content, and whatsoever we even have RSS and XML codes coming that actually work kind o like the former inventor wanted it.. great..

    So we have a functional sphere in which ever point is definied by a specific set of connections to its linkual context. Its functional place in the net.

    What do we want however: we want context stuff.
    We can do the very painful job of actually analyzing user behaviour and find out when someone actually visits the page - in what kind of mood, from where is he referred, how long does he stay..the protocolls already tell us alot, dont they..

    It would be smart to write code into the browser that allows us to make the browser to do the analytical work. Just like the specimen search project. Cause it is a pain in the ass to analyze user behaviour implicitly by looking at the logs.

    So however storing the information of each "point" in the system in a grid and adding content and context related information derived by statistical  analyzsis of user behavior (right, we dont need to watch everyone), we would be able to lay several other layers of grids over the actual grid ...

    we then had this information: Specific location of information, specifity of the actual content-linkage structure(useless rubbish is not as complex linked in a lets say 5-click horizon than important information..while top secret information refers not at all or only sparely), the place of the content based on its functional and contextual place in a represantative users everyday work,
    and well...

    Okay, lets get even more into this.. we can now even not only use the representative user in order induce structures into our hyper-search-engine..
    we can even create specific representative users. Bankers love to read articles about rockstars eating rats, and that is probably why they jump from a site with hot porn content on ...
    well, most obviously using representative schemes of access-behaviour and context-usage of specific user groups even allows us to eliminate thos sites that have no content on them but "money, stocks, moneystocks, stocksmoney, stockmoneys" .. you know these sites and google somehow cant handle them...

    And we actually seperate the web into "spheres of specific normal usage".

    Of course..all mining and better search rests within in the power of mapping information. Throwing information and functional relationships on a google earth map can help a lot and the actual information google or we as search engineerrs want is to know .. how do people react on advertisement in TV on the Net.who is whatching, where is he watching, etc..

    Knowing this we have to think about synnergies.. we have representative users, their specific spheres in which they move through the internet, we know a) what context they usually search in, (at a specific time of the day, from a specific location, from a specific company...etc) we also have clear indicators for abnormal behaviour (the banker surfs through the erotic worlds..) and find ways to re-contextualize it.. (we user of course the browser protocol and the users CPU to calculate his context.):..

    and so on..
    You might wonder..what has this to do with language analysis....

    You figured it.. NOTHING...Because language is not at all important ..and it is even less important to have a computer udnerstand a questions..
    If I type 20 words into a engine and the engine know exactly what I want, because that very combination of words actually cleary identifies context, function of search, quality of content wanted (popular, alternative, "underground", "sensitive) etc..then we have some good search engine.

    Fact is the user of a computer does not thing in grammer....and grammer is just some useless thing to talk about.
    Rate this comment: 12345

    BenShane
    08/05/2007
    Posts:2
    Avg Rating:
    5/5

Log In

Forgot your password?     Register »
Advertisement

Videos

Malleable Maps, Artistic Robots and Bubble Interfaces
Technology Review January/February 2010

Current Issue

Security in the Ether
Information technology's next grand challenge will be to secure the cloud--and prove we can trust it.
Advertisement
Advertisement
Subscribe to Technology Review's daily e-mail update. Enter your e-mail address

TECHNOLOGY RESOURCES

More Technology News from Forbes

Advertisement
MIT Massachusetts Institute of Technology © 2010 Technology Review. All Rights Reserved.