How Shining a Laser on Your Face Might Help Siri Understand You

Startup VocalZoom is building a sensor that measures the vibrations of your face to make it easier for you to control technology with your voice.

Rachel Metzarchive page

June 6, 2016

From Siri to Alexa to Cortana, we’re talking to virtual assistants more than ever before. They can still have trouble understanding simple commands to play music or look up directions, though, especially in noisy places.

Rather than focusing on cleaning up the audio signal that captures your voice, Israeli startup VocalZoom thinks it might be possible to make all kinds of speech-recognition applications work a lot better by using a tiny, low-power laser that measures the itty-bitty vibrations of your skin when you speak.

The company, which has raised about $12.5 million in venture funding thus far, is building a sensor with a small laser that it says will initially be built into headsets and helmets; there, it will be used alongside existing speech-recognition technologies that rely on microphones in order to reduce overall misunderstandings.

VocalZoom founder and CEO Tal Bakish thinks it will first be used for things like motorcycle helmets or headsets worn by warehouse workers—you might use it to ask for directions while riding your Harley, for instance. A Chinese speech-recognition company called iFlytek plans to have a prototype headset ready at the end of August. Bakish also expects it to be added to cars by 2018 for giving voice commands when you’re behind the wheel. The company has joint-development agreements with several automotive companies, though he won’t name them on the record, and he’s interested in getting the technology into smartphones, too.

At a noisy coffee shop in Boston, Bakish shows me a nonworking version of VocalZoom’s first product, slated to be ready this summer: a tiny sensor with a laser that shines directly at your face (he says it’s eye-safe according to U.S. Food and Drug Administration rules). If you were using one of these sensors in a headset to ask for directions to a restaurant, for instance, it would measure the velocity of your facial skin vibrations, while a regular audio signal would be captured by a microphone; software would then compare these two signals to come up with the best approximation of what you’re trying to say.

Bakish says VocalZoom’s sensor can measure vibrations of the skin from your eyes down to your throat and neck, and that it’s also possible to do so from behind, such as by analyzing vibrations behind your ears. The laser can work up to a meter away, though a five-centimeter distance is sufficient in, say, a headset.

Bakish says that when used alongside more standard audio-analyzing speech-recognition technology, VocalZoom has been able to cut speech-recognition error rates by 60 to 80 percent.

Abe Davis, a graduate student at MIT’s Computer Science and Artificial Intelligence Laboratory whose work has focused on gleaning audio from video by analyzing the tiny vibrations that various objects make, thinks it would be difficult to get VocalZoom to work in a car, where he suspects it could be hampered by things like your head moving around.

In a headset or helmet, however, he could see it being useful.

“It’s just a question of whether you can make sure the laser’s pointed at the right thing,” he says.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.