Where Siri Has Trouble Hearing, a Crowd of Humans Could Help

A program called Scribe harnesses humans on the Internet to generate speech captions in under five seconds.

Jessica Leberarchive page

March 18, 2013

Computer scientist Jeffrey Bigham has created a speech-recognition program that combines the best talents of machines and people.

Though voice recognition programs like Apple’s Siri and Nuance’s Dragon are quite good at hearing familiar voices and clearly dictated words, the technology still can’t reliably caption events that present new speakers, accents, phrases, and background noises. People are pretty good at understanding words in such situations, but most of us aren’t fast enough to transcribe the text in real time (that’s why professional stenographers can charge more than $100 an hour). So Bigham’s program Scribe augments fast computers with accurate humans in hopes of churning out captions and transcripts quickly.

This rapid-fire crowd-computing experiment could be a big help for deaf and hearing-impaired people. It also could also provide new ways to enhance voice recognition applications like Siri in areas where they struggle.

Scribe’s algorithms direct human workers to type out fragments of what they hear in a speech. By turning up the volume or slowing down the speed of slices of the audio, the program can direct different workers to unique but overlapping sections of a speech and then give them a few seconds to recover before asking them to type again.

Using natural-language processing algorithms, Scribe strings together the typed-out fragments into a complete transcript, and the redundant overlaps can help it weed out errors. (This shotgun computing technique is similar to the way many DNA sequencing machines work, Bigham points out.) It can produce a transcript or caption with a delay as short as three seconds using just three to five workers.

The only requirement is that the workers can hear and type, so even as a group, they cost less than a stenographer and don’t need days of advance notice, he notes. That could be a big help for a deaf student who wants to, say, take a new online class that hasn’t been captioned.

Bigham (see “Innovators Under 35, 2009: Jeffrey Bigham”) and his University of Rochester colleague Walter Lasecki have tested Scribe with laborers they found through Amazon’s Mechanical Turk, where people sign up to perform simple tasks. Those workers were paid a minimum of $6 an hour by Bigham’s team. The team also hired undergraduate work-study students for $10 an hour. The crowdsourced work of people in both groups appears to be only slightly less accurate than that of a professional stenographer, Bigham says. And in some cases, the pooled workers more accurately transcribed jargon terms that a single professional typist might mishear.

“What Scribe is starting to show is the ability to work together as part of a crowd to do very difficult performance tasks better than a person can do alone,” he says.

Bigham is now developing Scribe into an app that he hopes could help deaf people crowdsource transcripts quickly. To support a large number of users, he is also considering licensing the technology or spinning off a startup.

It’s not the first time someone has thought of using cheap, computer-coӧrdinated human labor to bolster the traditional weaknesses in artificial intelligence programs or other software. Twitter is hiring people on Mechanical Turk to help its search engine classify newsy topics that suddenly start trending. Bigham also has created a crowdsourced personal-assistance system called Chorus (see “Artificial Intelligence, Powered By Many Humans”) that could be smarter than Siri but cheaper than any individual hourly employee.

This is not to say that human labor will always outperform automated systems at transcribing speech. Aditya Parameswaran, a researcher at Stanford University who also works on human-assisted computation methods, says that as learning algorithms improve, crowdsourcing techniques like these will be useful mostly for augmenting the computers’ accuracy, rather than for having humans do the bulk of the work.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.