Skip to Content

Where Siri Has Trouble Hearing, a Crowd of Humans Could Help

A program called Scribe harnesses humans on the Internet to generate speech captions in under five seconds.
March 18, 2013

Computer scientist Jeffrey Bigham has created a speech-recognition program that combines the best talents of machines and people.

Though voice recognition programs like Apple’s Siri and Nuance’s Dragon are quite good at hearing familiar voices and clearly dictated words, the technology still can’t reliably caption events that present new speakers, accents, phrases, and background noises. People are pretty good at understanding words in such situations, but most of us aren’t fast enough to transcribe the text in real time (that’s why professional stenographers can charge more than $100 an hour). So Bigham’s program Scribe augments fast computers with accurate humans in hopes of churning out captions and transcripts quickly.

This rapid-fire crowd-computing experiment could be a big help for deaf and hearing-impaired people. It also could also provide new ways to enhance voice recognition applications like Siri in areas where they struggle.

Scribe’s algorithms direct human workers to type out fragments of what they hear in a speech. By turning up the volume or slowing down the speed of slices of the audio, the program can direct different workers to unique but overlapping sections of a speech and then give them a few seconds to recover before asking them to type again.

Using natural-language processing algorithms, Scribe strings together the typed-out fragments into a complete transcript, and the redundant overlaps can help it weed out errors. (This shotgun computing technique is similar to the way many DNA sequencing machines work, Bigham points out.) It can produce a transcript or caption with a delay as short as three seconds using just three to five workers.

The only requirement is that the workers can hear and type, so even as a group, they cost less than a stenographer and don’t need days of advance notice, he notes. That could be a big help for a deaf student who wants to, say, take a new online class that hasn’t been captioned.

Bigham (see “Innovators Under 35, 2009: Jeffrey Bigham”) and his University of Rochester colleague Walter Lasecki have tested Scribe with laborers they found through Amazon’s Mechanical Turk, where people sign up to perform simple tasks. Those workers were paid a minimum of $6 an hour by Bigham’s team. The team also hired undergraduate work-study students for $10 an hour. The crowdsourced work of people in both groups appears to be only slightly less accurate than that of a professional stenographer, Bigham says. And in some cases, the pooled workers more accurately transcribed jargon terms that a single professional typist might mishear.

“What Scribe is starting to show is the ability to work together as part of a crowd to do very difficult performance tasks better than a person can do alone,” he says.

Bigham is now developing Scribe into an app that he hopes could help deaf people crowdsource transcripts quickly. To support a large number of users, he is also considering licensing the technology or spinning off a startup.

It’s not the first time someone has thought of using cheap, computer-coӧrdinated human labor to bolster the traditional weaknesses in artificial intelligence programs or other software. Twitter is hiring people on Mechanical Turk to help its search engine classify newsy topics that suddenly start trending. Bigham also has created a crowdsourced personal-assistance system called Chorus (see “Artificial Intelligence, Powered By Many Humans”) that could be smarter than Siri but cheaper than any individual hourly employee.

This is not to say that human labor will always outperform automated systems at transcribing speech. Aditya Parameswaran, a researcher at Stanford University who also works on human-assisted computation methods, says that as learning algorithms improve, crowdsourcing techniques like these will be useful mostly for augmenting the computers’ accuracy, rather than for having humans do the bulk of the work.

Keep Reading

Most Popular

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.

OpenAI teases an amazing new generative video model called Sora

The firm is sharing Sora with a small group of safety testers but the rest of us will have to wait to learn more.

Google’s Gemini is now in everything. Here’s how you can try it out.

Gmail, Docs, and more will now come with Gemini baked in. But Europeans will have to wait before they can download the app.

This baby with a head camera helped teach an AI how kids learn language

A neural network trained on the experiences of a single young child managed to learn one of the core components of language: how to match words to the objects they represent.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.