Computing

Repetez, en anglais, s'il vous plait

(Page 2 of 2)

  • Wednesday, January 18, 2006
  • By Kate Greene

The U.S. Defense Advanced Research Projects Agency (DARPA) is one of the major funders of statistical machine translation. Last August, DARPA sponsored machine translation tests for Chinese and Arabic documents; a research group from Google scored the highest, nudging out USC’s Information Sciences Institute and IBM’s machine translation arm. Google, which also uses the statistical approach, might have had an edge, Knight notes, because they could use a huge number of computers for the word-crunching, and could draw from the entire Internet for their database of pretranslated documents.

In 2005, DARPA also announced the Global Autonomous Language Exploitation (GALE) program, intended to speed up the computer processing of huge numbers of translated documents acquired by its parent program, the Philadelphia-based Linguistic Data Consortium.** GALE is currently in the first year and will be transcribing speech from broadcast news sources and talk shows in Arabic, Chinese, and English, and also cataloguing text newswire feeds, Web news discussion groups, and blogs in those languages. For now, the project is focused mainly on data collection from these genres, with researchers in the computer and engineering science department at the University of Pennsylvania doing much of the work.

But even with a large collection of translated material, there will still be language issues to sort out. The next step in machine translation research, beyond matching words and phrases, Knight says, is to smooth out the grammatical inconsistencies that arise when words and phrases are strung together. This smoothing can be accomplished by indexing millions of sentences whose structures have been diagrammed at the University of Pennsylvania in the 1990s (the data came from 50,000 sentences in the Wall Street Journal). Similar to the way a database full of words and phrases allows translation software to choose the most statistically probable combination of words, these specific examples of grammar from the diagrammed sentences help the software assign the likelihood of word order, says MIT’s Collins.

This is an advance over the traditional method in which grammar rules were set in an algorithm, he says. Rather than obeying encoded grammar conventions in an algorithm, as with traditional machine translation, the diagrammed sentence database lets the software assign “probabilities and weight on those rules” says Collins. “[The software] is more likely to learn the context,” he says.

In some ways, however, the statistical approach will only be as good as the common instant-messaging translator. Proper names, for instance, still trip up even the most well-read machine translator, and they often just get translated along with the rest of the text. According to his system, Knight admits, the Spanish version of his surname is still “Kevin Caballero.”

** Correction, January 20, 2006: The original version of this story, published January 18, stated that the Linguistic Data Consortium was launched in 2005. In fact, the consortium was launched in 1992, and its Global Autonomous Language Exploitation project was launched in 2005. -- Eds.

Print

Related Articles

Less Lost in Translation

Non-native English speakers attempting to express themselves in the global language of business and science get a software assist from Microsoft's Beijing lab.

Translation in the Age of Terror

A new U.S. government center will connect linguists on the front lines of the war against terror with translation assistance technologies that can digitize, parse, and digest raw intelligence material.

The Translation Challenge

Software based on rules, examples, or statistics seeks to erase language barriers. It's far from perfect, but sometimes close is good enough.

Close Comments

To comment, please sign in or register

Forgot my password

Guest (Pat)

  • 2219 Days Ago
  • 01/18/2006

Wrong Example from the beginning of the Article

Go to the Japanese Apple Farm Web site, and look carefully. You will find out that the example given from the beginning is actually (most likely) a human translation--because it is part of the English version of the site.

Go to the Japanese version and use Google Translate, you will get this:
"The Someya apple garden it will pass very! It is planted in 1954, furthermore even now, .... " Much more unintelligible.

Admittedly a simple mistake. But well, if this esteemed magazine gets such very basics wrong, how can we trust it with more complicated stuff it routinely tries to cover!
(A high-profile example is the obviously highly biased coverage of Aubrey de Grey's work. By the way, when are you going to publish the result of SENS challenge?)

Reply

Guest (Mike Maxwell)

  • 2219 Days Ago
  • 01/18/2006

LDC info wrong

The para about the LDC is simply wrong.  The LDC has been around for 13 years; I worked there until September.  (See their website at http://www.ldc.upenn.edu.)

There are other errors, too, some of them trivial ("Kevin Caballero" is not a surname--only "Caballero" is a surname, "millions of sentences" should be "millions of words", etc.).  But all in all, I would say they need better fact checking.

Reply

Guest (Benoit Ozell)

  • 2219 Days Ago
  • 01/18/2006

Répétez...

If the aim is to have a good translation, the title should be: "Répétez, en anglais, s'il vous plaît".

Reply

Guest (Wade Roush)

  • 2219 Days Ago
  • 01/18/2006

Error corrected

Pat: Thank you for helping us to catch this mistake. We have corrected it, and appended an explanation of the error at the bottom of the article. -- Wade Roush, Executive Web Editor, TechnologyReview.com.

Reply

Guest (Mike Maxwell)

  • 2219 Days Ago
  • 01/18/2006

Error correction still unclear

From the Correction:

"...an excerpt from Google's translation of the apple farm's own English version of its website..."

I don't get it.  Google is translating English into English?  Shouldn't that be "...an excerpt from the apple farm's own English version of its website..."?

Reply

Guest (Andrew Mole)

  • 2218 Days Ago
  • 01/19/2006

Mis-spellings

Surely it should have been obvious that it was a human translation, since the spelling was so bad - "temperture", "uniquly" and "delitious", and equally clear that it is a translation by someone who is not a native English-speaker - i.e. not even a Google human translation, but rather home-grown at the Japanese Apple Farm. A correction to the correction is definitely in order as per Mike's comment.

Interesting article though! :)

Reply

Advertisement

MAGAZINE

Can We Build Tomorrow's Breakthroughs?

Manufacturing in the United States is in trouble. That's bad news not just for the country's economy but for the future of innovation.

Sponsored Content

Technologies from National Instruments

Adding Data Logging
Log measured data to a file and open it in Microsoft Excel

> Click here for more National Instruments Videos <
Whitepaper

Temperature Measurements with Thermocouples: How-To Guide

This document is part of the “How-To Guide for Most Common Measurements” centralized resource portal. This tutorial provides a detailed guide for measurement and device considerations to take temperature measurements using thermocouples. Get an introduction to thermocouples, which are inexpensive sensing devices widely used with PC-based data acquisition systems. Also review some specific thermocouple examples and learn how thermocouples work and ways to integrate them into a data acquisition measurement system.

View full PDF > Listen to story >
Find us on Youtube

Videos

A Robot Recruit that Can Do It All

More

Advertisement

Technology Review Lists

TR50

Our list of the 50 most innovative companies, including the following:

Nissan

Crowdcast

Facebook

First Solar

More

Advertisement

Facebook

Advertisement