Recovering intelligible speech by photographing the vibrations of a potato-chip bag sounds like the stuff of a spy novel. But researchers at MIT, Microsoft, and Adobe did just that—from 15 feet away through soundproof glass—using an algorithm they developed that reconstructs an audio signal by analyzing minute vibrations of objects depicted in video. They also extracted useful audio signals from videos of aluminum foil, the surface of a glass of water, and the leaves of a potted plant.
“When sound hits an object, it causes the object to vibrate,” says Abe Davis, a graduate student in electrical engineering and computer science and first author on a paper presented at the computer graphics conference Siggraph. “The motion of this vibration creates a very subtle visual signal that’s usually invisible to the naked eye. People didn’t realize that this information was there.”
Reconstructing audio from video requires that the frequency of the video samples—the number of frames captured per second—be higher than the frequency of the audio signal. For the video shot through soundproof glass, the researchers used a high-speed camera that captured 2,000 to 6,000 frames per second. That’s much faster than the 240 frames per second possible with the latest iPhone, but well below the frame rates of the best commercial high-speed cameras, which can top 100,000 frames per second.
In other experiments, they used an ordinary digital video camera, with a rate of only 60 frames per second. The sensor of a digital camera consists of an array of photodetectors—millions of them, even in inexpensive devices. As it turns out, it’s less expensive to design the sensor so that it reads off the measurements of one row of photodetectors at a time. With fast-moving objects, that can lead to odd visual artifacts: the object may move detectably between the reading of one row and the reading of the next.
Slight distortions of the edges of objects in conventional video, though invisible to the naked eye, thus contain information about the objects’ high-frequency vibrations. And that information is enough to yield a murky but potentially useful audio signal.
While this audio reconstruction wasn’t as faithful as those with the high-speed camera, it may still be good enough to identify how many people are speaking in a room, whether they are male or female, and even, given accurate enough information about the acoustic properties of speakers’ voices, who they are.
The miracle molecule that could treat brain injuries and boost your fading memory
Discovered more than a decade ago, a remarkable compound shows promise in treating everything from Alzheimer’s to brain injuries—and it just might improve your cognitive abilities.
This scientist now believes covid started in Wuhan’s wet market. Here’s why.
How a veteran virologist found fresh evidence to back up the theory that covid jumped from animals to humans in a notorious Chinese market—rather than emerged from a lab leak.
The US crackdown on Chinese economic espionage is a mess. We have the data to show it.
The US government’s China Initiative sought to protect national security. In the most comprehensive analysis of cases to date, MIT Technology Review reveals how far it has strayed from its goals.
A horrifying new AI app swaps women into porn videos with a click
Deepfake researchers have long feared the day this would arrive.
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.