Technology Review - Published By MIT
Advertisement
« Back 1 [2]

Friday, July 27, 2007

Building a Better Search Engine

Continued from page 1

By Michael Reisman

smaller text tool iconmedium text tool iconlarger text tool icon

The company plans to release demo versions of the search engine on its Powerlabs website, where consumers can test-drive the product beginning in September. User feedback will be taken into consideration as Powerset makes the final product, which is slated for release next year.

"The key challenge is to get the system to the point where people can understand how to use it and get real value out of these systems even though they are not perfect," Pell says. "We are finally at the point where we are going to cross that threshold."

IBM is also in the midst of developing a semantic search engine, code-named Avatar, which is targeted at enterprise and corporate customers; it's currently in beta testing within IBM. Project manager Shivakumar Vaithyanathan says that the hardest problems to overcome with natural-language search are finding a way to extract higher-level semantics from large documents while at the same time preserving precision and speed.

IBM's engine is targeted toward searches of internal documents such as e-mail and intranet correspondence. It's designed to be used in cases in which the user seeks to find one particular piece of information that could not be easily located, such as a specific phone number or package-tracking URL that's located in one of thousands of e-mails that a person may have stored on her computer.

Avatar's semantic search seeks to develop "interpretations" of keyword queries that model the real intent behind the query. For example, if the query was "phone number," the search engine would search the thousands of e-mails that a person may receive for the numbers that resemble a phone number. The search engine would provide the user with the useful information he seeks, and not just a keyword entry in an e-mail that contains the words mentioned in the query.

In order to quickly extract all the meaningful information from both the underlying text and the query, Vaithyanathan says, it's necessary to utilize either a lot of computers or a large number of people. Both options are expensive and can be difficult to implement. IBM hopes to find a way to extract meaning in less time and with fewer machines.

"If we do a better job of extracting, then we can do a better job of answering the questions that users give," Vaithyanathan says.

« Back 1 [2]

Resources

Events

Comments

  • Old stuff
    fischia on 07/27/2007 at 6:33 AM
    Posts:
    1
    Avg Rating:
    3/5
    What this people is about to create has been on the market for many years now. One of the companies that provide that technology is delphes.com
    Rate this comment: 12345
    • Re: Old stuff
      CHUNINGD on 07/28/2007 at 7:12 PM
      Posts:
      1
      I have recently submitted a provisional patent entitled "A method and system for improving the relevance of the results of queries of language based searchable databases."  This patent makes it possible for the creator of information and the searcher of information to define exactly the meaning of words solving the problem of homonyms.  For example if either uses the word plane it is possible for all concerned to know specifically if the reference is to a fixed wing aircraft, a wood carving tool etc.

      The provisonal patent can be viewed by going to:

      http://chuningd.googlepages.com

      I thought this might be of interest to you.  I welcome any feedback.

      chuningd@hotmail.com

      Sincerely,

      Dennis Chuning
      P O Box 2651
      Napa, CA  94558
      Rate this comment: 12345
  • Remember the "Dewey Decimal System"?
    fiberman on 07/27/2007 at 3:46 PM
    Posts:
    46
    Avg Rating:
    3/5
    In a industry survey on organizing the web about a dozen years ago, I asked why not use something like the DDS for organizing web pages. The industry created the basic format and the user must enter the appropriate information. But, no, keywords were chosen instead, leaving us crafty folks to use all our competitors names in the keywords to get their customers, until it was declared illegal.
    Rate this comment: 12345
  • My take
    BenShane on 08/05/2007 at 5:26 PM
    Posts:
    2
    Avg Rating:
    5/5
    I thought about the engine thing. And building a natural language of the computer was probably the best take.

    Yet what I am reading about it not intriguingly new. It is matter of fact old fashion.

    I mean not trying to copy the human behaviour is of course the most natural thing to do. The internet itself is much smarter in organizing its own content that is the human brain...

    It is like those folks messing around with robot brains that make them learn how to act and feel like humans while the answer is just that you have to organize any kind of whatever systemthat is able to analyse its own code and results in a memory stack, and that designs own and more reflexive feedback rings.
    The key of robot inteligence lies not in the ability of one code and robot being super intelligent, but in the ability that the experiences gained by a first generation of robots ist transferable to the next generation and that the next generation knows how to translate the experiences gained by older generations into its own system, which basically means that one most operate with different levels of code like we know it from the programming languages from assembler to such a thing as php..

    Whatever, I want to give some shot at the topic, just cause I have nothing to do and because Im bored.

    Search engines. it is not a real difficult problem to solve in the internet to re-tag/re-index the internet. Traditional literature propagates the model that one uses links, meta-tags mentioned in code and content, drop it into statistical grids and use data mining to get it smarter..

    In fact the first intelligent thing to do is to map the net. Meaning: ever action, every concent that is somehow related with one another is to be given a specific point in a hyper-sphere.

    If I have a web page, I have links on that page, I have dynamic content, I have specific coordinates in the window, and most important: I have a browser logging the human - computer interaction.

    Allowing for example - just for fun - that the user uses his own smartness to tag the internet on a browser - based protocoll that stores information given on a specific page and coordinate (a click and a tag that addresses the text or the specific word in the text) would make such a lot of sense because if you organize this behaviour and make each users work available through a social network- kind of programm that filters good content from bad content using status information and ratings, you could make the dumb user re-contextualize the whole internet tagging the content in the browser wherever he wants to...
    and of course using automated scripts the user could generate own pages using any kind of accessable and located information somewhere ni the net to display it in another window..on his "hyperpage"..

    But, this is boring, right? This does not address the search problem, because it actually changes the very way the internet is hyperlinked rather than making the old net accessible..

    So..using contexts..even without browser coordinates we can get a pretty perfect grid of information relations in texts. The former inventor of the web, yall might know him.. actually solved a kind of  a problem by adding a code to each paragraph of text written to tag its origin and to make it entirely genuine. This would solve the problem, too. But we dont even have that technology ..so what do we do ?
    We make a dumb net more intelligebly searchable..

    Still we have links, we have meta-information in the page, and we even can use fast algorithms to get a grip of the gist of a specific content, and whatsoever we even have RSS and XML codes coming that actually work kind o like the former inventor wanted it.. great..

    So we have a functional sphere in which ever point is definied by a specific set of connections to its linkual context. Its functional place in the net.

    What do we want however: we want context stuff.
    We can do the very painful job of actually analyzing user behaviour and find out when someone actually visits the page - in what kind of mood, from where is he referred, how long does he stay..the protocolls already tell us alot, dont they..

    It would be smart to write code into the browser that allows us to make the browser to do the analytical work. Just like the specimen search project. Cause it is a pain in the ass to analyze user behaviour implicitly by looking at the logs.

    So however storing the information of each "point" in the system in a grid and adding content and context related information derived by statistical  analyzsis of user behavior (right, we dont need to watch everyone), we would be able to lay several other layers of grids over the actual grid ...

    we then had this information: Specific location of information, specifity of the actual content-linkage structure(useless rubbish is not as complex linked in a lets say 5-click horizon than important information..while top secret information refers not at all or only sparely), the place of the content based on its functional and contextual place in a represantative users everyday work,
    and well...

    Okay, lets get even more into this.. we can now even not only use the representative user in order induce structures into our hyper-search-engine..
    we can even create specific representative users. Bankers love to read articles about rockstars eating rats, and that is probably why they jump from a site with hot porn content on ...
    well, most obviously using representative schemes of access-behaviour and context-usage of specific user groups even allows us to eliminate thos sites that have no content on them but "money, stocks, moneystocks, stocksmoney, stockmoneys" .. you know these sites and google somehow cant handle them...

    And we actually seperate the web into "spheres of specific normal usage".

    Of course..all mining and better search rests within in the power of mapping information. Throwing information and functional relationships on a google earth map can help a lot and the actual information google or we as search engineerrs want is to know .. how do people react on advertisement in TV on the Net.who is whatching, where is he watching, etc..

    Knowing this we have to think about synnergies.. we have representative users, their specific spheres in which they move through the internet, we know a) what context they usually search in, (at a specific time of the day, from a specific location, from a specific company...etc) we also have clear indicators for abnormal behaviour (the banker surfs through the erotic worlds..) and find ways to re-contextualize it.. (we user of course the browser protocol and the users CPU to calculate his context.):..

    and so on..
    You might wonder..what has this to do with language analysis....

    You figured it.. NOTHING...Because language is not at all important ..and it is even less important to have a computer udnerstand a questions..
    If I type 20 words into a engine and the engine know exactly what I want, because that very combination of words actually cleary identifies context, function of search, quality of content wanted (popular, alternative, "underground", "sensitive) etc..then we have some good search engine.

    Fact is the user of a computer does not thing in grammer....and grammer is just some useless thing to talk about.
    Rate this comment: 12345
Advertisement

Current Issue

Technology Review November/December 2008
Sun + Water = Fuel
An MIT chemist has opened the way to making hydrogen fuel from water using sunlight.
•  Subscribe
Save 41%
•  Table of Contents
•  MIT News

Magazine Services

Career Resources

MIT Technology Insider

Stories and breaking news from inside MIT about the latest research, innovations, and startups--in a convenient monthly e-newsletter. Subscribe today

Follow us on Twitter

Twitter

Get Technology Review updates via the web, cellphone, or Instant Messager – Follow techreview on Twitter!

Advertisement
Advertisement
Advertisement
TECHNOLOGY RESOURCES
Advertisement
MIT Massachusetts Institute of Technology