Select your localized edition:

Close ×

More Ways to Connect

Discover one of our 28 local entrepreneurial communities »

Be the first to know as we launch in new countries and markets around the globe.

Interested in bringing MIT Technology Review to your local market?

MIT Technology ReviewMIT Technology Review - logo

 

Unsupported browser: Your browser does not meet modern web standards. See how it scores »

{ action.text }

The U.S. Defense Advanced Research Projects Agency (DARPA) is one of the major funders of statistical machine translation. Last August, DARPA sponsored machine translation tests for Chinese and Arabic documents; a research group from Google scored the highest, nudging out USC’s Information Sciences Institute and IBM’s machine translation arm. Google, which also uses the statistical approach, might have had an edge, Knight notes, because they could use a huge number of computers for the word-crunching, and could draw from the entire Internet for their database of pretranslated documents.

In 2005, DARPA also announced the Global Autonomous Language Exploitation (GALE) program, intended to speed up the computer processing of huge numbers of translated documents acquired by its parent program, the Philadelphia-based Linguistic Data Consortium.** GALE is currently in the first year and will be transcribing speech from broadcast news sources and talk shows in Arabic, Chinese, and English, and also cataloguing text newswire feeds, Web news discussion groups, and blogs in those languages. For now, the project is focused mainly on data collection from these genres, with researchers in the computer and engineering science department at the University of Pennsylvania doing much of the work.

But even with a large collection of translated material, there will still be language issues to sort out. The next step in machine translation research, beyond matching words and phrases, Knight says, is to smooth out the grammatical inconsistencies that arise when words and phrases are strung together. This smoothing can be accomplished by indexing millions of sentences whose structures have been diagrammed at the University of Pennsylvania in the 1990s (the data came from 50,000 sentences in the Wall Street Journal). Similar to the way a database full of words and phrases allows translation software to choose the most statistically probable combination of words, these specific examples of grammar from the diagrammed sentences help the software assign the likelihood of word order, says MIT’s Collins.

This is an advance over the traditional method in which grammar rules were set in an algorithm, he says. Rather than obeying encoded grammar conventions in an algorithm, as with traditional machine translation, the diagrammed sentence database lets the software assign “probabilities and weight on those rules” says Collins. “[The software] is more likely to learn the context,” he says.

In some ways, however, the statistical approach will only be as good as the common instant-messaging translator. Proper names, for instance, still trip up even the most well-read machine translator, and they often just get translated along with the rest of the text. According to his system, Knight admits, the Spanish version of his surname is still “Kevin Caballero.”

** Correction, January 20, 2006: The original version of this story, published January 18, stated that the Linguistic Data Consortium was launched in 2005. In fact, the consortium was launched in 1992, and its Global Autonomous Language Exploitation project was launched in 2005. – Eds.

0 comments about this story. Start the discussion »

Reprints and Permissions | Send feedback to the editor

From the Archives

Close

Introducing MIT Technology Review Insider.

Already a Magazine subscriber?

You're automatically an Insider. It's easy to activate or upgrade your account.

Activate Your Account

Become an Insider

It's the new way to subscribe. Get even more of the tech news, research, and discoveries you crave.

Sign Up

Learn More

Find out why MIT Technology Review Insider is for you and explore your options.

Show Me