You probably use voice recognition technology already, if in a limited capacity. Maybe you use Google’s voice-activated search, or take advantage of its (somewhat wonky) voice-mail transcriptions in Google Voice. At the office, maybe you use Dragon dictation software. Even if these programs worked perfectly, though (which they don’t) they would still leave something to be desired. Voice recognition software today works in very specialized circumstances—it can typically recognize only one voice at a time, and it performs best when it has reams of data in the archive before tackling a new speech sample.
What if we had voice recognition technology that didn’t have so many strictures? What if we had software that was quick and nimble, able to discern one speaker from another on the fly? In other words, what if voice recognition technology was more like the way voice recognition actually works in the real world, in the human brain?
A coalition of three British Universities—the Universities of Cambridge, Sheffield, and Edinburgh—is working to bring us what they call “natural speech technology.” Google and Dragon are (relatively) good at what they do, Thomas Hain of Sheffield recently told The Engineer. “But where it’s about natural speech—people having a normal conversation—these applications still have very poor performance.”
With nearly $10 million of funding from Britain’s Engineering and Physical Sciences Research Council, the team has set itself four main technical objectives.
First, they want to make speech software that’s smart–that can learn and adapt on the fly. They intend to build models and algorithms that can “adapt to new scenarios and speaking styles, and seamlessly adapt to new situations and contexts almost instantaneously,” the team members write.
Second, they want those models and algorithms to be smart enough to eavesdrop on a meeting, and to be able to sift “who spoke what, when, and how”—in other words, they want speech software as adept as a great human stenographer. Then, looking forward, the team’s third and fourth goals are to create technologies building on their models: speech synthesizers (for sufferers of stroke or neurodegenerative diseases) that learn from data and that are “capable of generating the full expressive diversity of natural speech”; and various other applications. These are as yet vaguely defined, but which might include something the team calls “personal listeners.”
It’s very ambitious stuff, enough to make you pause and consider a future in which speech recognition is ubiquitous, seamless, and orders of magnitude more useful than it is today. Some of the researchers are already at work on some applications; Hain’s award-winning team is collaborating with the BBC to transcribe its back catalog of audio and video footage.
DeepMind’s cofounder: Generative AI is just a phase. What’s next is interactive AI.
“This is a profound moment in the history of technology,” says Mustafa Suleyman.
What to know about this autumn’s covid vaccines
New variants will pose a challenge, but early signs suggest the shots will still boost antibody responses.
Human-plus-AI solutions mitigate security threats
With the right human oversight, emerging technologies like artificial intelligence can help keep business and customer data secure
Next slide, please: A brief history of the corporate presentation
From million-dollar slide shows to Steve Jobs’s introduction of the iPhone, a bit of show business never hurt plain old business.
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.