As a global resource built from the spare time of millions of volunteers, Wikipedia may be the epitome of Web 2.0. But the Wikimedia Foundation, a nonprofit organization that runs Wikipedia, among other projects, is now thinking about how to make it a linchpin of Web 3.0, or the semantic Web.
That means making some of the data on Wikipedia’s 15 million (and counting) articles understandable to computers as well as humans. This would allow software to know, for example, that the numbers shown in one of the columns in this table listing U.S. presidents are dates. That could, in turn, allow applications that draw on Wikipedia to automatically generate historical timelines or answer the kind of general knowledge questions that would usually entail a person finding and reading a relevant entry on the site.
At the 2010 Semantic Technology conference in San Francisco last month, the foundation’s deputy director, Erik Möller, and colleague Trevor Parscal, a user-experience developer for Wikimedia, showed some first steps taken by the foundation to explore how more semantic structure might be added to Wikipedia. They also appealed to the semantic Web community to help develop ways to make Wikipedia’s knowledge more accessible to computers and software.
“Semantic information already exists in Wikipedia, and people are already building on it,” says Möller. “Unfortunately, we’re not really helping, and they have to use extensive processing to do so.”
One example is DBPedia, a semantic database built using software collect data from the site’s pages, and maintained by the Free University of Berlin and the University of Leipzig, both in Germany. Another is Freebase, a for-profit knowledge database, much of which was also sourced by scraping Wikipedia. Freebase is the data source used by question-answering search engine PowerSet, which was acquired by Microsoft to be part of its Bing search engine.
The first targets for Möller and Parscal are the “infoboxes” that appear as summaries on many Wikipedia pages, and the tables in entries, such as this one showing the gross national product of all the countries in the world.
“Just being able to reuse that data within Wikipedia would be a big thing,” says Yaron Koren, who runs a consultancy that specializes in Semantic MediaWiki, an extension to the MediaWiki software used to build Wikipedia. “The manual work that goes into maintaining the many tables and lists today could be eliminated,” he adds. Instead, lists could be automatically generated from the infoboxes of other pages. It would also be possible to generate maps, using the location coordinates that feature on some pages, or automatically generate timelines to summarize periods in history covered by many other pages, says Möller.
Möller says an example of the kind of services that could be enabled is WikiPics, developed by Daniel Kinzler at the German Wikimedia foundation. Kinzler scraped a database of all the links that connect different Wikipedia pages available in multiple languages and built a fully multilingual image search. When a user puts in the term “horse,” for example, the service knows to also find images of “cheval” (French) and “Pferd” (German). “You’re searching concepts instead of terms,” says Möller. However, for now the site relies on the slow process of scraping the whole of Wikipedia to update its knowledge. A semantic Wikipedia would maintain a live database that could be queried at any time.
Wikipedia faces two big challenges in embracing semantic concepts, says Möller. One is that no one has yet built a semantic web service on the scale of a site such as Wikipedia, and it is unclear whether existing software like Semantic MediaWiki is up to the task, he says.
A second challenge is the feature of Wikipedia most responsible for its success so far: its community. “Thinking about adding semantic structure is a natural extension of what Wikipedia needs to do, given prevailing trends,” says Andrew Lih of the University of Southern California, and author of the 2009 book The Wikipedia Revolution. “But I do worry a bit about the database aspect that comes with this–the attraction of wikis in the first place is in the way they have been hand-edited by humans.”
Parscal has been leading efforts to make it easy for anyone to add or edit the data of a large semantic store. “We’ve been working on a visual editor that suggests how we might help users contribute structured data, and that also makes the editing process easier,” says Parscal.
Editing Wikipedia today is already a daunting process that needs improvement, admits Parscal. “If you’ve interacted with our interface,” he explains, “you’ve been slapped in the face by wikitext” (a markup language that uses special code around text to format things like links, references, and section headings). The wikitext for tables or infoboxes–the information most ripe for making semantic–is particularly dense and hard to understand, says Parscal. “We recently did some user experience studies with people that hadn’t used it before; they were quickly quite frustrated.”
In future, it may be possible to remove the need for a human to populate some parts of Wikipedia altogether, says Möller. “Fundamentally a lot of this data probably shouldn’t be entered by humans in the first place, it should just, say, poll the source of a figure like GDP once a year.” That’s a capability that Koren has already added to Semantic MediaWiki, through an extension called ExternalData.