Skip to Content

Baidu’s Deep-Learning System Rivals People at Speech Recognition

China’s dominant Internet company, Baidu, is developing powerful speech recognition for its voice interfaces.
December 16, 2015

China’s leading Internet-search company, Baidu, has developed a voice system that can recognize English and Mandarin speech better than people, in some cases.

The new system, called Deep Speech 2, is especially significant in how it relies entirely on machine learning for translation. Whereas older voice-recognition systems include many handcrafted components to aid audio processing and transcription, the Baidu system learned to recognize words from scratch, simply by listening to thousands of hours of transcribed audio.

The technology relies on a powerful technique known as deep learning, which involves training a very large multilayered virtual network of neurons to recognize patterns in vast quantities of data. The Baidu app for smartphones lets users search by voice, and also includes a voice-controlled personal assistant called Duer (see “Baidu’s Duer Joins the Personal Assistant Party”). Voice queries are more popular in China because it is more time-consuming to input text, and because some people do not know how to use Pinyin, the phonetic system for transcribing Mandarin using Latin characters.

“Historically, people viewed Chinese and English as two vastly different languages, and so there was a need to design very different features,” says Andrew Ng, a former Stanford professor and Google researcher, and now chief scientist for the Chinese company. “The learning algorithms are now so general that you can just learn.”

Deep learning has its roots in ideas first developed more than 50 years ago, but in the past few years new mathematical techniques, combined with greater computer power and huge quantities of training data, have led to remarkable progress, especially in tasks that require some sort of visual or auditory perception. The technique has already improved the performance of voice recognition and image processing, and large companies including Google, Facebook, and Baidu are applying it to the massive data sets they own.

Deep learning is also being adopted for ever-more tasks. Facebook, for example, uses deep learning to find faces in the images that its users upload. And more recently it has made progress in using deep learning to parse written text (see “Teaching Machines to Understand Us”). Google now uses deep learning in more than 100 different projects, from search to self-driving cars.

In 2013, Baidu opened its own effort to harness this new technology, the Deep Learning Institute, co-located at the company’s Beijing headquarters and in Silicon Valley. Deep Speech 2 was primarily developed by a team in California.

In developing Deep Speech 2, Baidu also created new hardware architecture for deep learning that runs seven times faster than the previous version. Deep learning usually relies on graphics processors, because these are good for the intensive parallel computations involved.

The speed achieved “allowed us to do experimentation on a much larger scale than people had achieved previously,” says Jesse Engel, a research scientist at Baidu and one of more than 30 researchers named on a paper describing Deep Speech 2. “We were able to search over a lot of [neural network] architectures, and reduce the word error rate by 40 percent.”

Ng adds that this has recently produced some impressive results. “For short phrases, out of context, we seem to be surpassing human levels of recognition,” he says.

He adds: “In Mandarin, there are a lot of regional dialects that are spoken by much smaller populations, so there’s much less data. This could help us recognize the dialects better.”

Keep Reading

Most Popular

Geoffrey Hinton tells us why he’s now scared of the tech he helped build

“I have suddenly switched my views on whether these things are going to be more intelligent than us.”

ChatGPT is going to change education, not destroy it

The narrative around cheating students doesn’t tell the whole story. Meet the teachers who think generative AI could actually make learning better.

Meet the people who use Notion to plan their whole lives

The workplace tool’s appeal extends far beyond organizing work projects. Many users find it’s just as useful for managing their free time.

Learning to code isn’t enough

Historically, learn-to-code efforts have provided opportunities for the few, but new efforts are aiming to be inclusive.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at with a list of newsletters you’d like to receive.