Visual Audio

Analyzing video of vibrations makes it possible to reconstruct sound.

Larry Hardestyarchive page

October 21, 2014

Recovering intelligible speech by photographing the vibrations of a potato-chip bag sounds like the stuff of a spy novel. But researchers at MIT, Microsoft, and Adobe did just that—from 15 feet away through soundproof glass—using an algorithm they developed that reconstructs an audio signal by analyzing minute vibrations of objects depicted in video. They also extracted useful audio signals from videos of aluminum foil, the surface of a glass of water, and the leaves of a potted plant.

“When sound hits an object, it causes the object to vibrate,” says Abe Davis, a graduate student in electrical engineering and computer science and first author on a paper presented at the computer graphics conference Siggraph. “The motion of this vibration creates a very subtle visual signal that’s usually invisible to the naked eye. People didn’t realize that this information was there.”

Reconstructing audio from video requires that the frequency of the video samples—the number of frames captured per second—be higher than the frequency of the audio signal. For the video shot through soundproof glass, the researchers used a high-speed camera that captured 2,000 to 6,000 frames per second. That’s much faster than the 240 frames per second possible with the latest iPhone, but well below the frame rates of the best commercial high-speed cameras, which can top 100,000 frames per second.

In other experiments, they used an ordinary digital video camera, with a rate of only 60 frames per second. The sensor of a digital camera consists of an array of photodetectors—millions of them, even in inexpensive devices. As it turns out, it’s less expensive to design the sensor so that it reads off the measurements of one row of photodetectors at a time. With fast-moving objects, that can lead to odd visual artifacts: the object may move detectably between the reading of one row and the reading of the next.

Slight distortions of the edges of objects in conventional video, though invisible to the naked eye, thus contain information about the objects’ high-frequency vibrations. And that information is enough to yield a murky but potentially useful audio signal.

While this audio reconstruction wasn’t as faithful as those with the high-speed camera, it may still be good enough to identify how many people are speaking in a room, whether they are male or female, and even, given accurate enough information about the acoustic properties of speakers’ voices, who they are.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.