Something Lost in Skype Translation

Skype’s real-time translation software highlights remarkable progress in machine learning—but it still struggles with the subtleties of human communication.

John Pavlusarchive page

January 15, 2015

It sometimes seems as if the highest praise an innovative new technology can earn is a credulous comparison to Star Trek. The Oculus Rift is like the Holodeck; 3-D printers are like matter replicators; Qualcomm is even sponsoring an X-Prize contest to build a working tricorder.

And now Skype Translator, a real-time voice and text language translation app currently available to Windows 8.1 users as a public beta, is being widely compared to the “universal translator” that Captains Kirk and Picard used to effortlessly communicate with alien interlocutors. Skype Translator is less capable than that pat sci-fi analogy implies, but its limitations are as fascinating as its formidable technical achievements.

Skype Translator performs instant translation of text chats in over 40 languages, but its marquee feature is real-time, spoken translation between English and Spanish speakers. (Microsoft, which owns Skype, would not comment on what other languages it is planning to incorporate into the software or when we might expect them.)

Unlike Star Trek’s fictional translator, Skype Translator is designed to emulate a human interpreter who acts as an intermediary between the two primary speakers. This virtual interpreter is customizable: I could select a male or female voice and even set its tolerance for translating profanity (I didn’t put that feature to the test). Then, much as a human translator would, it “listened” to my speech, waited for a pause, and spoke my words in Spanish to the Microsoft consultant on the other end of the call. The spoken translation was audible to both of us. And it was often surprisingly accurate.

In theory, Skype translation could be transformative. It’s like a version of the discreet live translation that world leaders enjoy when visiting the United Nations. In practice, though, it can be more like having Apple’s Siri (or Microsoft’s Cortana) constantly interrupting your conversation and talking over you.

Even such crude automated translation is fairly remarkable. It is notoriously difficult for machines to recognize words and phrases quickly and accurately, and Skype Translator achieves a high level of accuracy using a technique known as deep learning. Software running on Microsoft’s servers was trained to recognize words using methods of information processing loosely modeled on the way a biological brain functions (see “10 Breakthrough Technologies 2013: Deep Learning”).

Deep learning lets Microsoft’s computers reliably transform a stream of audio speech into chunks of text, which can then be analyzed using standard translation methods. As more people use the software, this system should become more effective at recognizing idiosyncrasies of accent and cadence, potentially making Skype Translator—and Skype itself—more useful.

Microsoft’s software tries to filter out “disfluencies” (such as “um,” “ah,” and repetitions) on the word and sentence level. Some of these disfluencies made it through during my conversation, but the translation still occurred with impressive speed and accuracy.

The limitations of Skype’s translation software are also revealing, since they show how difficult it is for even the smartest machine to mimic the subtleties of effective human conversation. Determining which meaning of a word is appropriate in different contexts can be vexing. “If software is translating between American and British English, and it recognizes the word ‘football,’ it also needs to know when to change it to ‘soccer’ and when to keep it as ‘football’ or ‘gridiron,’” says Christopher Manning, a professor of linguistics and computer science in Stanford University’s Natural Language Processing Group.

Skype Translator is also deaf to the rhythms of normal spoken conversation, so you can’t be quite sure when its disembodied robot voice is going to break in and start blurting out its translated version. This is something we humans sometimes find challenging, too. “Even with human translators, you need to learn when to pause to let the interpreter absorb what you just said and repeat it,” says Vikram Dendi, strategy director at Microsoft Research.

With practice I could probably learn Skype Translator’s “rhythm” in the same way, which could make the audio experience less distracting. Introducing an on-screen avatar for the “bot” might also help reinforce the metaphor of a third person on the call, perhaps making it easier for the two human speakers to modulate their conversation in a way that makes room for the software speaking on their behalf.

But Skype Translator actually has a fairly elegant solution built in already: on-screen translated text of the spoken conversation, generated in real time. This interface is less overtly futuristic than spoken translation, but it feels more natural. And obvious mistakes are easy to correct, since either party can type into the chat window where the translations appear.

Dendi admits that Skype and Microsoft still don’t know yet what an ideal user experience for the software looks like. “When we watch these things in action on TV [as on Star Trek], it seems so obvious: you just speak and it comes out translated,” he says. “But when you start digging into the actual implementation and put it in people’s hands to use, there are so many little details that can make or break the experience.”

Other efforts to harness deep learning could help. Researchers at Google and the University of Montreal are applying such methods to speech translation itself (as opposed to just speech recognition) “with stunning success,” according to Stanford’s Manning. Further advances could someday make real-time machine translation virtually perfect. Or progress could hit a wall. “The jury is still out,” Manning says. “I think it’s still unclear where the limits of deep learning are for solving higher-level cognitive processing problems.”

Skype Translator certainly hasn’t solved the problem just yet. But it’s a great start on breaking down some language barriers for now.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.