For Computers, Too, It’s Hard to Learn to Speak Chinese

Challenging written characters make voice-based computing a natural for China, but computers that can hold a conversation in Chinese are some way off.

Yiting Sunarchive page

July 25, 2017

Researchers often call 2017 the year of the conversational computer in China. Leveraging recent advances in voice recognition and natural-language processing, e-commerce giant Alibaba and search giant Baidu have both been developing technology to crack voice-based communication (see 10 Breakthrough Technologies: Conversational Interfaces.) Now voice-operated products derived from Baidu and Alibaba’s technology are coming to the Chinese market.

The Tmall Genie, which has Alibaba’s voice assistant, AliGenie, built in, is akin to the Amazon Echo. It can place online orders, check the weather, play your favorite music, and control other smart devices in your home through voice commands.

Baidu’s DuerOS conversational platform has been added as a feature in such products as a home-assistant robot, a television set-top box, and an HTC smartphone. It has similar functions to AliGenie and other voice assistants, as well as rudimentary abilities to conduct a random chat, and the company says it has received a large number of orders for its DuerOS development kit.

Kun Jing, general manager of Baidu’s Duer business unit, expects many more companies to enter the field this year, motivated partly by the success of products like the Echo in the U.S market, which has piqued the interest of Chinese tech investors.

Research firm IDC predicts that by 2020, 51 percent of the smart-driving industry and 68 percent of the cell phone and wearables industry in China will have a conversation-based AI system baked in. Just as the touch screen made interacting with a mobile device so much easier, conversational interfaces will make interaction more natural and draw more people into the connected world, says Jing, who oversees the development of DuerOS.

Voice-based computing is a good option for China. Today typing Chinese on a typical QWERTY keyboard relies on a system called “pinyin,” based on characters’ pronunciation, but since there are four tones in Mandarin and each has a different meaning, the user must painstakingly select the right character from a drop-down menu after typing the pronunciation. A common syllable like “yi” can correspond to 60 or more frequently used Chinese characters. Some input methods can prioritize the most likely character according to the context, but they are not always accurate. Unsurprisingly, users of mobile technologies like the popular WeChat communication app tend to leave verbal messages for one another, rather than the typed texts typical in the U.S.

In China today, voice assistant technology works by turning a user’s voice commands into text and generating a response based on the meaning of the text. That process works pretty well for task-based commands—check the weather or look for the English translation of a particular Chinese word—but it cannot sustain a back-and-forth conversation about multiple subjects.

Solving conversational computing will require overcoming some of the challenging complexities of the Chinese language. In Chinese, for example, the same characters arranged in different order mean different things, and even when arranged in the same order, they can have different meanings depending on what comes before or after them. In addition, written Chinese does not have spaces naturally dividing words as English does. So Chinese natural-language-processing researchers must teach their algorithms where to insert spaces in order to establish the proper meaning of a particular combination of characters. The absence of Chinese verb tenses—there are no distinctive forms for past, present, or future—also makes it challenging for machines to decipher the timeline of a sequence.

Chinese natural-language-processing researchers are tackling other challenges, too: Numerous dialects exist, some of which are mutually incomprehensible, and the same expression can mean different things in different contexts.

Zhiyong Wu, an associate professor at Tsinghua University who studies natural-language understanding, notes that for computers to truly understand the intent of a human speaker and communicate appropriately, they will have to pick up subtle clues such as intonation and stress. They will also have to understand emotions, since humans’ decision making is not based solely on logic, notes Jia Jia, an associate professor at Tsinghua University who studies social affective computing.

To make its system smarter, Baidu introduced a “trainer” mode in its platform this year to allow software developers to contribute language data in real time through a built-in annotator bot. The bot receives developer feedback (such as the explanation of a query the system didn’t understand the first time), learns from that, and then corrects the system.

One advantage Chinese researchers have as they try to solve these problems is a large quantity of data. The neural networks that underpin the language understanding of today’s computers require large amounts of data to train. The more data a company has, the smarter its neural networks will become, and companies like Baidu and Alibaba have the benefit of vast user bases. As of the end of 2016, Baidu claimed 665 million monthly active mobile users, and as of March this year, Alibaba had 507 million mobile monthly active users.

But Gang Wang, a scientist at Alibaba’s A.I. Lab, says researchers will have to design neural networks that don’t need a lot of data to become more efficient at learning language. In the real world, people express the same meaning in different ways, and it’s impossible to teach the computer every possible expression, he notes. In his previous role as an academic researcher, he and his colleagues came up with a method for teaching computers to understand a subject when very little data is available: use data from related subjects. For example, to train a neural network to understand texts in sports medicine, you could draw upon data from sports and data from medicine. The approach is not as good as using organic data, Wang notes, but when that’s lacking, it does make it possible to train neural networks on a topic.

Ultimately, what will make a voice assistant succeed in China is its content and services, says Chenfeng Song, founder of Ainemo, a startup that makes a voice-activated home assistant robot called Little Fish that went on sale in June. Song plans to gradually build educational and health-care programs into his company’s home assistant. Little Fish uses the DuerOS conversational platform. Voice, Song notes, is a way to deliver content to people who cannot access the Internet very well through desktop computers and smartphones, especially the elderly and young children.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.