A Smarter Web

New technologies will make online search more intelligent–and may even lead to a “Web 3.0.”

John Borlandarchive page

March 1, 2007

This article appears in the March/April 2007 issue of Technology Review.

Last year, Eric Miller, an MIT-affiliated computer scientist, stood on a beach in southern France, watching the sun set, studying a document he’d printed earlier that afternoon. A March rain had begun to fall, and the ink was beginning to smear.

Five years before, he’d agreed to lead a diverse group of researchers working on a project called the Semantic Web, which seeks to give computers the ability–the seeming intelligence–to understand content on the World Wide Web. At the time, he’d made a list of goals, a copy of which he now held in his hand. If he’d achieved those goals, his part of the job was done.

Taking stock on the beach, he crossed off items one by one. The Semantic Web initiative’s basic standards were in place; big companies were involved; startups were merging or being purchased; analysts and national and international newspapers, not just technical publications, were writing about the project. Only a single item remained: taking the technology mainstream. Maybe it was time to make this happen himself, he thought. Time to move into the business world at last.

Multimedia

Video: Tim Berners-Lee talks about the Semantic Web.

“For the Semantic Web, it was no longer a matter of if but of when,” Miller says. “I felt I could be more useful by helping people get on with it.”

Now, six months after the launch of his own Zepheira, a consulting company that helps businesses link fragmented data sources into easily searched wholes, Miller’s beachside decision seems increasingly prescient. The Semantic Web community’s grandest visions, of data-surfing computer servants that automatically reason their way through problems, have yet to be fulfilled. But the basic technologies that Miller shepherded through research labs and standards committees are joining the everyday Web. They can be found everywhere–on entertainment and travel sites, in business and scientific databases–and are forming the core of what some promoters call a nascent “Web 3.0.”

Already, these techniques are helping developers stitch together complex applications or bring once-inaccessible data sources online. Semantic Web tools now in use improve and automate database searches, helping people choose vacation destinations or sort through complicated financial data more efficiently. It may be years before the Web is populated by truly intelligent software agents automatically doing our bidding, but their precursors are helping people find better answers to questions today.

The “3.0” claim is ambitious, casting these new tools as successors to several earlier–but still viable–generations of Net technology. Web 1.0 refers to the first generation of the commercial Internet, dominated by content that was only marginally interactive. Web 2.0, characterized by features such as tagging, social networks, and user-created taxonomies of content called “folksonomies,” added a new layer of interactivity, represented by sites such as Flickr, Del.icio.us, and Wikipedia.

Analysts, researchers, and pundits have subsequently argued over what, if anything, would deserve to be called “3.0.” Definitions have ranged from widespread mobile broadband access to a Web full of on-demand software services. A much-read article in the New York Times last November clarified the debate, however. In it, John Markoff defined Web 3.0 as a set of technologies that offer efficient new ways to help computers organize and draw conclusions from online data, and that definition has since dominated discussions at conferences, on blogs, and among entrepreneurs.

The 3.0 moniker has its critics. Miller himself, like many in his research community, frowns at the idea of applying old-fashioned software release numbers to a Web that evolves continually and on many fronts. Yet even skeptics acknowledge the advent of something qualitatively different. Early versions of technologies that meet Markoff’s definition are being built into the new online TV service Joost. They’ve been used to organize Yahoo’s food section and make it more searchable. They’re part of Oracle’s latest, most powerful database suite, and Hewlett-Packard has produced open-source tools for creating Semantic Web applications. Massive scientific databases, such as the Creative Commons-affiliated Neurocommons, are being constructed around the new ideas, while entrepreneurs are readying a variety of tools for release this year.

The next wave of technologies might ultimately blend pared-down Semantic Web tools with Web 2.0’s capacity for dynamic user-generated connections. It may include a dash of data mining, with computers automatically extracting patterns from the Net’s hubbub of conversation. The technology will probably take years to fulfill its promise, but it will almost certainly make the Web easier to use.

“There is a clear understanding that there have to be better ways to connect the mass of data online and interrogate it,” says Daniel Waterhouse, a partner at the venture capital firm 3i. Waterhouse calls himself skeptical of the “Web 3.0” hyperbole but has invested in at least one Semantic Web-based business, the U.K. company Garlik. “We’re just at the start,” he says. “What we can do with search today is very primitive.”

Melvil Dewey and the Vision of a New Web

For more than a decade, Miller has been at the center of this slow-cresting technological wave. Other names have been more prominent–Web creator Tim Berners-Lee is the Semantic Web’s most visible proselytizer, for example. But Miller’s own experiences trace the technology’s history, from academic halls through standards bodies and, finally, into the private sector.

In the often scruffy Web world, the 39-year-old Miller has been a clean-cut exception, an articulate and persuasive technological evangelist who looks less programmer than confident young diplomat. He’s spent most of his professional life in Dublin, OH, far from Silicon Valley and from MIT, where he continues to serve as a research scientist. But it’s no accident that Zepheira is based in this Columbus suburb, or that Miller himself has stayed put. Dublin is a hub of digital library science, and as the Semantic Web project has attempted to give order to the vast amounts of information online, it has naturally tapped the expertise of library researchers here.

Miller joined this community as a computer engineering student at Ohio State University, near the headquarters of a group called the Online Computer Library Center (OCLC). His initial attraction was simple: OCLC had the largest collection of computers in the vicinity of Ohio State. But it also oversees the venerable Dewey Decimal System, and its members are the modern-day inheritors of Melvil Dewey’s obsession with organizing and accessing information.

Dewey was no technologist, but the libraries of his time were as poorly organized as today’s Web. Books were often placed in simple alphabetical order, or even lined up by size. Libraries commonly numbered shelves and assigned books to them heedless of subject matter. As a 21-year-old librarian’s assistant, Dewey found this system appalling: order, he believed, made for smoother access to information.

Dewey envisioned all human knowledge as falling along a spectrum whose order could be represented numerically. Even if arbitrary, his system gave context to library searches; when seeking a book on Greek history, for example, a researcher could be assured that other relevant texts would be nearby. A book’s location on the shelves, relative to nearby books, itself aided scholars in their search for information.

As the Web gained ground in the early 1990s, it naturally drew the attention of Miller and the other latter-day Deweys at OCLC. Young as it was, the Web was already outgrowing attempts to categorize its contents. Portals like Yahoo forsook topic directories in favor of increasingly powerful search tools, but even these routinely produced irrelevant results.

Nor was it just librarians who worried about this disorder. Companies like Netscape and Microsoft wanted to lead their customers to websites more efficiently. Berners-Lee himself, in his original Web outlines, had described a way to add contextual information to hyperlinks, to offer computers clues about what would be on the other end.

This idea had been dropped in favor of the simple, one-size-fits-all hyperlink. But Berners-Lee didn’t give it up altogether, and the idea of connecting data with links that meant something retained its appeal.

On the Road to Semantics

By the mid-1990s, the computing community as a whole was falling in love with the idea of metadata, a way of providing Web pages with computer-readable instructions or labels that would be invisible to human readers.

To use an old metaphor, imagine the Web as a highway system, with hyperlinks as connecting roads. The early Web offered road signs readable by humans but meaningless to computers. A human might understand that “FatFelines.com” referred to cats, or that a link led to a veterinarian’s office, but computers, search engines, and software could not.

Metadata promised to add the missing signage. XML–the code underlying today’s complicated websites, which describes how to find and display content–emerged as one powerful variety. But even XML can’t serve as an ordering principle for the entire Web; it was designed to let Web developers label data with their own custom “tags”–as if different cities posted signs in related but mutually incomprehensible dialects.

In early 1996, researchers at the MIT-based World Wide Web Consortium (W3C) asked Miller, then an Ohio State graduate student and OCLC researcher, for his opinion on a different type of metadata proposal. The U.S. Congress was looking for ways to keep children from being exposed to sexually explicit material online, and Web researchers had responded with a system of computer-readable labels identifying such content. The labels could be applied either by Web publishers or by ratings boards. Software could then use these labels to filter out objectionable content, if desired.

Miller, among others, saw larger possibilities. Why, he asked, limit the descriptive information associated with Web pages to their suitability for minors? If Web content was going to be labeled, why not use the same infrastructure to classify other information, like the price, subject, or title of a book for sale online? That kind of general-purpose metadata–which, unlike XML, would be consistent across sites–would be a boon to people, or computers, looking for things on the Web.

This idea resonated with other Web researchers, and in the late 1990s it began to bear fruit. Its first major result was the Resource Description Framework (RDF), a new system for locating and describing information whose specifications were published as a complete W3C recommendation in 1999. But over time, proponents of the idea became more ambitious and began looking to the artificial-intelligence community for ways to help computers independently understand and navigate through this web of metadata.

Since 1998, researchers at W3C, led by Berners-Lee, had been discussing the idea of a “semantic” Web, which not only would provide a way to classify individual bits of online data such as pictures, text, or database entries but would define relationships between classification categories as well. Dictionaries and thesauruses called “ontologies” would translate between different ways of describing the same types of data, such as “post code” and “zip code.” All this would help computers start to interpret Web content more efficiently.

In this vision, the Web would take on aspects of a database, or a web of databases. Databases are good at providing simple answers to queries because their software understands the context of each entry. “One Main Street” is understood as an address, not just random text. Defining the context of online data just as clearly–labeling a cat as an animal, and a veterinarian as an animal doctor, for example–could result in a Web that computers could browse and understand much as humans do, researchers hoped.

To go back to the Web-as-highway metaphor, this might be analogous to creating detailed road signs that cars themselves could understand and upon which they could act. The signs might point out routes, describe road and traffic conditions, and offer detailed information about destinations. A car able to understand the signs could navigate efficiently to its destination, with minimal intervention by the driver.

In articles and talks, Berners-Lee and others began describing a future in which software agents would similarly skip across this “web of data,” understand Web pages’ metadata content, and complete tasks that take humans hours today. Say you’d had some lingering back pain: a program might determine a specialist’s availability, check an insurance site’s database for in-plan status, consult your calendar, and schedule an appointment. Another program might look up restaurant reviews, check a map database, cross-reference open table times with your calendar, and make a dinner reservation.

At the beginning of 2001, the effort to realize this vision became official. The W3C tapped Miller to head up a new Semantic Web initiative, unveiled at a conference early that year in Hong Kong. Miller couldn’t be there in person; his wife was in labor with their first child, back in Dublin. Miller saw it as a double birthday.

Standards and Critics

The next years weren’t easy. Miller quickly had to become researcher, diplomat, and evangelist. The effort to build the Semantic Web has been well publicized, and Berners-Lee’s name in particular has lent its success an air of near-inevitability. But its visibility has also made it the target of frequent, and often harsh, criticism.

Some argue that it’s unrealistic to expect busy people and businesses to create enough metadata to make the Semantic Web work. The simple tagging used in Web 2.0 applications lets users spontaneously invent their own descriptions, which may or may not relate to anything else. Semantic Web systems require a more complicated infrastructure, in which developers order terms according to their conceptual relationships to one another and–like Dewey with his books–fit data into the resulting schema. Creating and maintaining these schemas, or even adapting preëxisting ones, is no trivial task. Coding a database or website with metadata in the language of a schema can itself be painstaking work. But the solution to this problem may simply be better tools for creating metadata, like the blog and social-networking sites that have made building personal websites easy. “A lot of Semantic Web researchers have realized this disconnect and are investing in more human interfaces,” says David Huynh, an MIT student who has helped create several such tools.

Other critics have questioned whether the ontologies designed to translate between different data descriptions can realistically help computers understand the intricacies of even basic human concepts. Equating “post code” and “zip code” is easy enough, the critics say. But what happens when a computer stumbles on a word like “marriage,” with its competing connotations of monogamy, polygamy, same-sex relationships, and civil unions? A system of interlocking computer definitions could not reliably capture the conflicting meanings of many such common words, the argument goes.

“People forget there are humans under the hood and try to treat the Web like a database instead of a social construct,” says Clay Shirky, an Internet consultant and adjunct professor of interactive telecommunications at New York University.

It hasn’t helped that until very recently, much of the work on the Semantic Web has been hidden inside big companies or research institutions, with few applications emerging. But that paucity of products has masked a growing amount of experimentation. Miller’s W3C working group, which included researchers and technologists from across academia and industry, was responsible for setting the core standards, a process completed in early 2004. Like HP, other companies have also created software development tools based on these standards, while a growing number of independent researchers have applied them to complicated data sets.

Life scientists with vast stores of biological data have been especially interested. In a recent trial project at Massachusetts General Hospital and Harvard University, conducted in collaboration with Miller when he was still at the W3C, clinical data was encoded using Semantic Web techniques so that researchers could share it and search it more easily. The Neurocommons project is taking the same approach with genetic and biotech research papers. Funded by the scientific-data management company Teranode, the Neurocommons is again working closely with W3C, as well as with MIT’s Computer Science and Artificial Intelligence Laboratory.

Government agencies have conducted similar trials, with the U.S. Defense Advanced Research Projects Agency (DARPA) investing heavily in its own research and prototype projects based on the Semantic Web standards. The agency’s former Information Exploitation Office program manager Mark Greaves, who oversaw much of its Semantic Web work, remains an enthusiastic backer.

“What we’re trying to do with the Semantic Web is build a digital Aristotle,” says Greaves, now senior research program manager at Paul Allen’s investment company, Vulcan, which is sponsoring a large-scale artificial-intelligence venture called Project Halo that will use Semantic Web data-representation techniques. “We want to take the Web and make it more like a database, make it a system that can answer questions, not just get a pile of documents that might hold an answer.”

Into the Real World

If Miller’s sunset epiphany showed him the path forward, the community he represented was following similar routes. All around him, ideas that germinated for years in labs and research papers are beginning to take root in the marketplace.

But they’re also being savagely pruned. Businesses, even Miller’s Zepheira, are adopting the simplest Semantic Web tools while putting aside the more ambitious ones. Entrepreneurs are blending Web 2.0 features with Semantic Web data-handling techniques. Indeed, if there is to be a Web 3.0, it is likely to include only a portion of the Semantic Web community’s work, along with a healthy smattering of other technologies. “The thing being called Web 3.0 is an important subset of the Semantic Web vision,” says Jim Hendler, professor of computer science at Rensselaer Polytechnic Institute, who was one of the initiative’s pioneer theorists. “It’s really a realization that a little bit of Semantic Web stuff with what’s called Web 2.0 is a tremendously powerful technology.”

Much of that technology is still invisible to consumers, as big companies internally apply the Semantic Web’s efficient ways of organizing data. Miller’s Zepheira, at least today, is focused on helping them with that job. Zepheira’s pitch to companies is fairly simple, perhaps looking once again to Dewey’s disorganized libraries. Businesses are awash in inaccessible data on intranets, in unconnected databases, even on employees’ hard drives. For each of its clients, Zepheira aims to bring all that data into the light, code it using Semantic Web techniques, and connect it so that it becomes useful across the organization. In one case, that might mean linking Excel documents to payroll or customer databases, in another, connecting customer accounts to personalized information feeds. These disparate data sources would be tied together with RDF and other Semantic Web mechanisms that help computerized search tools find and filter information more efficiently.

One of the company’s early clients is Citigroup. The banking giant’s global head of capital markets and banking technology, Chris Augustin, is heading an initiative to use semantic technologies to organize and correlate information from diverse financial-data feeds. The goal is to help identify capital-market investment opportunities. “We are interested in providing our customers and traders with the latest information in the most relevant and timely manner to help them make the best decisions quickly,” says Rachel Yager, the program director overseeing the effort.

Others are beginning to apply semantic techniques to consumer-focused businesses, varying widely in how deeply they draw from the Semantic Web’s well.

The Los Altos, CA-based website RealTravel, created by chief executive Ken Leeder, AdForce founder Michael Tanne, and Semantic Web researcher Tom Gruber, offers an early example of what it will look like to mix Web 2.0 features like tagging and blogging with a semantic data-organization system. The U.K.-based Garlik, headed by former top executives of the British online bank Egg, uses an RDF-based database as part of a privacy service that keeps customers apprised of how much of their personal information is appearing online. “We think Garlik’s technology gives them a really interesting technology advantage, but this is at a very early stage,” says 3i’s Waterhouse, whose venture firm helped fund Garlik. “Semantic technology is going to be a slow burn.”

San Francisco-based Radar Networks, created by EarthWeb cofounder Nova Spivack and funded in part by Allen’s Vulcan Capital, plans eventually to release a full development platform for commercial Semantic Web applications, and will begin to release collaboration and information-sharing tools based on the techniques this year. Spivack himself has been part of the Semantic Web community for years, most recently working with DARPA and SRI International on a long-term project called CALO (Cognitive Agent that Learns and Organizes), which aims to help military analysts filter and analyze new data.

Radar Networks’ tools will be based on familiar ideas such as sharing bookmarks, notes, and documents, but Spivack says that ordering and linking this data within the basic Semantic Web framework will help teams analyze their work more efficiently. He predicts that the mainstream Web will spend years assimilating these basic organization processes, using RDF and related tools, while the Semantic Web’s more ambitious artificial-intelligence applications wait in the wings.

“First comes what I call the World Wide Database, making data accessible through queries, with no AI involved,” Spivack says. “Step two is the intelligent Web, enabling software to process information more intelligently. That’s what we’re working on.”

One of the highest-profile deployments of Semantic Web technology is courtesy of Joost, the closely watched Internet television startup formed by the creators of Skype and Kazaa. The company has moved extraordinarily quickly from last year’s original conception, through software development and Byzantine negotiations with video content owners, into beta-testing of its customizable peer-to-peer TV software.

That would have been impossible if not for the Semantic Web’s RDF techniques, which Joost chief technology officer Dirk-Willem van Gulik calls “XML on steroids.” RDF allowed developers to write software without worrying about widely varying content-use restrictions or national regulations, all of which could be accommodated afterwards using RDF’s Semantic Web linkages.

Joost’s RDF infrastructure also means that users will have wide-ranging control over the service, van Gulik adds. People will be able to program their own virtual TV networks–if an advertiser wants its own “channel,” say, or an environmental group wants to bring topical content to its members–by using the powerful search and filtering capacity inherent in the semantic ordering of data.

But van Gulik’s admiration goes only so far. While he believes that the simpler elements of the Semantic Web will be essential to a huge range of online businesses, the rest he can do without. “RDF [and the other rudimentary semantic technologies] solve meaningful problems, and it costs less than any other approach would,” he says. “The entire remainder”–the more ambitious work with ontologies and artificial intelligence–“is completely academic.”

A Hybrid 3.0

Even as Semantic Web tools begin to reach the market, so do similar techniques developed outside Miller’s community. There are many ways, the market seems to be saying, to make the Web give ever better answers.

Semantic Web technologies add order to data from the outset, putting up the road signs that let computers understand what they’re reading. But many researchers note that much of the Web lacks such signs and probably always will. Computer scientists call this data “unstructured.”

Much research has focused on helping computers extract answers from this unstructured data, and the results may ultimately complement Semantic Web techniques. Data-mining companies have long worked with intelligence agencies to find patterns in chaotic streams of information and are now turning to commercial applications. IBM already offers a service that combs blogs, message boards, and newsgroups for discussions of clients’ products and draws conclusions about trends, without the help of metadata’s signposts.

“We don’t expect everyone to go through the massive effort of using Semantic Web tools,” says Maria Azua, vice president of technology and innovation at IBM. “If you have time and effort to do it, do it. But we can’t wait for everyone to do it, or we’ll never have this additional information.”

An intriguing, if stealthy, company called Metaweb Technologies, spun out of Applied Minds by parallel-computing pioneer Danny Hillis, is promising to “extract ordered knowledge out of the information chaos that is the current Internet,” according to its website. Hillis has previously written about a “Knowledge Web” with data-organization characteristics similar to those that Berners-Lee champions, but he has not yet said whether Metaweb will be based on Semantic Web standards. The company has been funded by Benchmark Capital, Millennium Technology Ventures, and eBay founder Pierre Omidyar, among others.

“We’ve built up a set of powerful tools and utilities and initiatives in the Web-based community, and to leverage and harness them, an infrastructure is desperately needed,” says Millennium managing partner Dan Burstein. “The Web needs extreme computer science to support these applications.”

Alternatively, the socially networked, tag-rich services of Flickr, Last.fm, Del.icio.us, and the like are already imposing a grassroots order on collections of photos, music databases, and Web pages. Allowing Web users to draw their own connections, creating, sharing, and modifying their own systems of organization, provides data with structure that is usefully modeled on the way people think, advocates say.

“The world is not like a set of shelves, nor is it like a database,” says NYU’s Shirky. “We see this over and over with tags, where we have an actual picture of the human brain classifying information.”

No one knows what organizational technique will ultimately prevail. But what’s increasingly clear is that different kinds of order, and a variety of ways to unearth data and reuse it in new applications, are coming to the Web. There will be no Dewey here, no one system that arranges all the world’s digital data in a single framework.

Even in his role as digital librarian, as custodian of the Semantic Web’s development, Miller thinks this variety is good. It’s been one of the goals from the beginning, he says. If there is indeed a Web 3.0, or even just a 2.1, it will be a hybrid, spun from a number of technological threads, all helping to make data more accessible and more useful.

“It’s exciting to see Web 2.0 and social software come on line, but I find it even more exciting when that data can be shared,” Miller says. “This notion of trying to recombine the data together, and driving new kinds of data, is really at the heart of what we’ve been focusing on.”

John Borland is the coauthor of Dungeons and Dreamers: The Rise of Computer Game Culture from Geek to Chic. He lives in Berlin.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.