Masters of Multimedia
Eric Chang is a sultan of speech. He talks fast, asks lots of questions, and seems to know what you’re going to say before you say it. It’s a bit unnerving at first, but given his graduate training in speech recognition at MIT, it makes sense. And since computer keyboards have trouble accommodating Asian languages-thousands of characters, in contrast to a few dozen letters-part of the motivation for Chang’s speech group in Beijing is to develop better interfaces for Asian users. Speech-based systems are part of Microsoft’s plan to enable legions of Chinese, for starters, to access information and communicate more effectively.
Chang walks into the office of a young researcher, Min Chu, and asks her to fire up the text-to-speech demo. Chu types in a sentence-in Chinese but sprinkled with English words, as is common in technical passages and discussions. After a few seconds, the computer generates a natural-sounding female voice, which sounds perfectly bilingual as it repeats the typed sentence over speakers on the desktop.
The trick is to get the inflections, timing, and transitions from word to word to sound just right-and not like a robotic monotone. Unlike other speech synthesizers, Chang and Chu’s software breaks text into different-size chunks-phonemes, syllables, or whole words-and uses a database of more than 10,000 spoken sentences to select and piece together the right sounds. This bilingual synthesizer is “really head and shoulders above anything I’ve heard,” says MIT’s Zue, an expert on spoken-language systems.
It’s an example of how the lab’s cultural perspective has been instrumental in solving problems. The first goal of the project was to create a Mandarin speech synthesizer for the Chinese market. “In 2001, we had our first Bill G.’ review,” says Chang. “He said, That’s good, but I don’t understand Chinese.’” That reaction from Microsoft’s chairman motivated Chang’s group to apply the same mathematical models to English. Because pitch matters so much in Mandarin-a subtle change of tone is all that distinguishes the word for “mother” from the word for “horse”-the system was better able to capture the inflections of English and other languages as well. Expect to see this voice synthesis software on the market in the next few years, says Chang, who recently became assistant managing director of the lab’s Advanced Technology Center.
The Beijing lab is also helping Microsoft understand the Asian marketplace in more immediate consumer areas, such as multimedia communications over mobile devices. Already, there are more than 240 million cell-phone users in China alone. They tend to update their services more often than U.S. users and are more interested in gadgets generally, says Shipeng Li, head of the lab’s Internet media group and another former Sarnoff researcher. “Here it’s like fashion,” he says.
The stylishly casual Li wears jeans and comes across as more laid-back than other researchers. His group is all about smooth-smooth video, that is. In the next room, one of Li’s 20 students has set up a demo of one of the world’s first videoconferencing systems that runs on a handheld computer. The student picks up the handheld-which houses a video camera, microphone, wireless link, and data communication software-and speaks into it. His face shows up on the screen of a nearby desktop computer, which is similarly equipped. The video is encoded at 10 frames per second, enough to look fairly smooth, with an audio delay of about half a second as the researchers talk back and forth. Although the quality is lower than that of normal video, says Li, it’s still far higher than that of existing handheld technologies.
The key advance: software running on each user’s computer monitors data channel conditions, takes into account what kinds of devices are being used, and efficiently compresses the video stream so that fewer bits need to be sent. Some 50,000 users have downloaded the latest prototype version of the software from Microsoft’s website. If transmission delays can be reduced, Li says, handheld videophones should take off in the Asian market within three years.
But there are nearer-term applications, too. Take Web downloads of multimedia files. Researchers in Li’s group are developing ways to code video so it can be sent to your desktop without the pauses, skips, and hang-ups that are all too common with today’s Internet links. Li’s system does this by adapting to the conditions of the data connection.
Li employs a simple analogy to explain Microsoft’s advance. Imagine media content as “freight to be transported,” he says. Instead of today’s strategy of sending it in one big truck, which can get stuck in a traffic jam, Li’s team sends it in pieces in smaller vehicles, giving higher priority to those bits identified ahead of time as being especially important. Even if some pieces get stuck or lost, on average the most important ones-those that describe the basic picture structure and how it’s changing-get through.
The end result is smoother, more reliable video downloads. Using the technology, Li plays a video of singer Christina Aguilera; right next to it, he plays the same video on Microsoft’s current media player. The new version is less jerky and doesn’t skip. Indeed, says Li, the next release of Microsoft’s media player will incorporate this smooth scheme, courtesy of the Beijing lab.