Skip to Content

Baidu’s Deep-Learning System Rivals People at Speech Recognition

China’s dominant Internet company, Baidu, is developing powerful speech recognition for its voice interfaces.
December 16, 2015

China’s leading Internet-search company, Baidu, has developed a voice system that can recognize English and Mandarin speech better than people, in some cases.

The new system, called Deep Speech 2, is especially significant in how it relies entirely on machine learning for translation. Whereas older voice-recognition systems include many handcrafted components to aid audio processing and transcription, the Baidu system learned to recognize words from scratch, simply by listening to thousands of hours of transcribed audio.

The technology relies on a powerful technique known as deep learning, which involves training a very large multilayered virtual network of neurons to recognize patterns in vast quantities of data. The Baidu app for smartphones lets users search by voice, and also includes a voice-controlled personal assistant called Duer (see “Baidu’s Duer Joins the Personal Assistant Party”). Voice queries are more popular in China because it is more time-consuming to input text, and because some people do not know how to use Pinyin, the phonetic system for transcribing Mandarin using Latin characters.

“Historically, people viewed Chinese and English as two vastly different languages, and so there was a need to design very different features,” says Andrew Ng, a former Stanford professor and Google researcher, and now chief scientist for the Chinese company. “The learning algorithms are now so general that you can just learn.”

Deep learning has its roots in ideas first developed more than 50 years ago, but in the past few years new mathematical techniques, combined with greater computer power and huge quantities of training data, have led to remarkable progress, especially in tasks that require some sort of visual or auditory perception. The technique has already improved the performance of voice recognition and image processing, and large companies including Google, Facebook, and Baidu are applying it to the massive data sets they own.

Deep learning is also being adopted for ever-more tasks. Facebook, for example, uses deep learning to find faces in the images that its users upload. And more recently it has made progress in using deep learning to parse written text (see “Teaching Machines to Understand Us”). Google now uses deep learning in more than 100 different projects, from search to self-driving cars.

In 2013, Baidu opened its own effort to harness this new technology, the Deep Learning Institute, co-located at the company’s Beijing headquarters and in Silicon Valley. Deep Speech 2 was primarily developed by a team in California.

In developing Deep Speech 2, Baidu also created new hardware architecture for deep learning that runs seven times faster than the previous version. Deep learning usually relies on graphics processors, because these are good for the intensive parallel computations involved.

The speed achieved “allowed us to do experimentation on a much larger scale than people had achieved previously,” says Jesse Engel, a research scientist at Baidu and one of more than 30 researchers named on a paper describing Deep Speech 2. “We were able to search over a lot of [neural network] architectures, and reduce the word error rate by 40 percent.”

Ng adds that this has recently produced some impressive results. “For short phrases, out of context, we seem to be surpassing human levels of recognition,” he says.

He adds: “In Mandarin, there are a lot of regional dialects that are spoken by much smaller populations, so there’s much less data. This could help us recognize the dialects better.”

Keep Reading

Most Popular

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.

OpenAI teases an amazing new generative video model called Sora

The firm is sharing Sora with a small group of safety testers but the rest of us will have to wait to learn more.

Google’s Gemini is now in everything. Here’s how you can try it out.

Gmail, Docs, and more will now come with Gemini baked in. But Europeans will have to wait before they can download the app.

How one mine could unlock billions in EV subsidies

The Inflation Reduction Act is starting to transform the US economy. To understand how, we tallied up the potential tax credits available as the nickel from a single mine flows through the supply chain.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.