Skip to Content

Baidu’s Deep-Learning System Rivals People at Speech Recognition

China’s dominant Internet company, Baidu, is developing powerful speech recognition for its voice interfaces.
December 16, 2015

China’s leading Internet-search company, Baidu, has developed a voice system that can recognize English and Mandarin speech better than people, in some cases.

The new system, called Deep Speech 2, is especially significant in how it relies entirely on machine learning for translation. Whereas older voice-recognition systems include many handcrafted components to aid audio processing and transcription, the Baidu system learned to recognize words from scratch, simply by listening to thousands of hours of transcribed audio.

The technology relies on a powerful technique known as deep learning, which involves training a very large multilayered virtual network of neurons to recognize patterns in vast quantities of data. The Baidu app for smartphones lets users search by voice, and also includes a voice-controlled personal assistant called Duer (see “Baidu’s Duer Joins the Personal Assistant Party”). Voice queries are more popular in China because it is more time-consuming to input text, and because some people do not know how to use Pinyin, the phonetic system for transcribing Mandarin using Latin characters.

“Historically, people viewed Chinese and English as two vastly different languages, and so there was a need to design very different features,” says Andrew Ng, a former Stanford professor and Google researcher, and now chief scientist for the Chinese company. “The learning algorithms are now so general that you can just learn.”

Deep learning has its roots in ideas first developed more than 50 years ago, but in the past few years new mathematical techniques, combined with greater computer power and huge quantities of training data, have led to remarkable progress, especially in tasks that require some sort of visual or auditory perception. The technique has already improved the performance of voice recognition and image processing, and large companies including Google, Facebook, and Baidu are applying it to the massive data sets they own.

Deep learning is also being adopted for ever-more tasks. Facebook, for example, uses deep learning to find faces in the images that its users upload. And more recently it has made progress in using deep learning to parse written text (see “Teaching Machines to Understand Us”). Google now uses deep learning in more than 100 different projects, from search to self-driving cars.

In 2013, Baidu opened its own effort to harness this new technology, the Deep Learning Institute, co-located at the company’s Beijing headquarters and in Silicon Valley. Deep Speech 2 was primarily developed by a team in California.

In developing Deep Speech 2, Baidu also created new hardware architecture for deep learning that runs seven times faster than the previous version. Deep learning usually relies on graphics processors, because these are good for the intensive parallel computations involved.

The speed achieved “allowed us to do experimentation on a much larger scale than people had achieved previously,” says Jesse Engel, a research scientist at Baidu and one of more than 30 researchers named on a paper describing Deep Speech 2. “We were able to search over a lot of [neural network] architectures, and reduce the word error rate by 40 percent.”

Ng adds that this has recently produced some impressive results. “For short phrases, out of context, we seem to be surpassing human levels of recognition,” he says.

He adds: “In Mandarin, there are a lot of regional dialects that are spoken by much smaller populations, so there’s much less data. This could help us recognize the dialects better.”

Keep Reading

Most Popular

still from Embodied Intelligence video
still from Embodied Intelligence video

These weird virtual creatures evolve their bodies to solve problems

They show how intelligence and body plans are closely linked—and could unlock AI for robots.

pig kidney transplant surgery
pig kidney transplant surgery

Surgeons have successfully tested a pig’s kidney in a human patient

The test, in a brain-dead patient, was very short but represents a milestone in the long quest to use animal organs in human transplants.

conceptual illustration showing various women's faces being scanned
conceptual illustration showing various women's faces being scanned

A horrifying new AI app swaps women into porn videos with a click

Deepfake researchers have long feared the day this would arrive.

thermal image of young woman wearing mask
thermal image of young woman wearing mask

The covid tech that is intimately tied to China’s surveillance state

Heat-sensing cameras and face recognition systems may help fight covid-19—but they also make us complicit in the high-tech oppression of Uyghurs.

Stay connected

Illustration by Rose WongIllustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.