Skip to Content
77 Mass Ave

Visual Audio

Analyzing video of vibrations makes it possible to reconstruct sound.
October 21, 2014

Recovering intelligible speech by photographing the vibrations of a potato-chip bag sounds like the stuff of a spy novel. But researchers at MIT, Microsoft, and Adobe did just that—from 15 feet away through soundproof glass—using an algorithm they developed that reconstructs an audio signal by analyzing minute vibrations of objects depicted in video. They also extracted useful audio signals from videos of aluminum foil, the surface of a glass of water, and the leaves of a potted plant.

“When sound hits an object, it causes the object to vibrate,” says Abe Davis, a graduate student in electrical engineering and computer science and first author on a paper presented at the computer graphics conference Siggraph. “The motion of this vibration creates a very subtle visual signal that’s usually invisible to the naked eye. People didn’t realize that this information was there.”

Reconstructing audio from video requires that the frequency of the video samples—the number of frames captured per second—be higher than the frequency of the audio signal. For the video shot through soundproof glass, the researchers used a high-speed camera that captured 2,000 to 6,000 frames per second. That’s much faster than the 240 frames per second possible with the latest iPhone, but well below the frame rates of the best commercial high-speed cameras, which can top 100,000 frames per second.

In other experiments, they used an ordinary digital video camera, with a rate of only 60 frames per second. The sensor of a digital camera consists of an array of photodetectors—millions of them, even in inexpensive devices. As it turns out, it’s less expensive to design the sensor so that it reads off the measurements of one row of photodetectors at a time. With fast-moving objects, that can lead to odd visual artifacts: the object may move detectably between the reading of one row and the reading of the next.

Slight distortions of the edges of objects in conventional video, though invisible to the naked eye, thus contain information about the objects’ high-­frequency vibrations. And that information is enough to yield a murky but potentially useful audio signal.

While this audio reconstruction wasn’t as faithful as those with the high-speed camera, it may still be good enough to identify how many people are speaking in a room, whether they are male or female, and even, given accurate enough information about the acoustic properties of speakers’ voices, who they are.

Keep Reading

Most Popular

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.

How scientists traced a mysterious covid case back to six toilets

When wastewater surveillance turns into a hunt for a single infected individual, the ethics get tricky.

The problem with plug-in hybrids? Their drivers.

Plug-in hybrids are often sold as a transition to EVs, but new data from Europe shows we’re still underestimating the emissions they produce.

It’s time to retire the term “user”

The proliferation of AI means we need a new word.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.