Recovering intelligible speech by photographing the vibrations of a potato-chip bag sounds like the stuff of a spy novel. But researchers at MIT, Microsoft, and Adobe did just that—from 15 feet away through soundproof glass—using an algorithm they developed that reconstructs an audio signal by analyzing minute vibrations of objects depicted in video. They also extracted useful audio signals from videos of aluminum foil, the surface of a glass of water, and the leaves of a potted plant.
“When sound hits an object, it causes the object to vibrate,” says Abe Davis, a graduate student in electrical engineering and computer science and first author on a paper presented at the computer graphics conference Siggraph. “The motion of this vibration creates a very subtle visual signal that’s usually invisible to the naked eye. People didn’t realize that this information was there.”
Reconstructing audio from video requires that the frequency of the video samples—the number of frames captured per second—be higher than the frequency of the audio signal. For the video shot through soundproof glass, the researchers used a high-speed camera that captured 2,000 to 6,000 frames per second. That’s much faster than the 240 frames per second possible with the latest iPhone, but well below the frame rates of the best commercial high-speed cameras, which can top 100,000 frames per second.
In other experiments, they used an ordinary digital video camera, with a rate of only 60 frames per second. The sensor of a digital camera consists of an array of photodetectors—millions of them, even in inexpensive devices. As it turns out, it’s less expensive to design the sensor so that it reads off the measurements of one row of photodetectors at a time. With fast-moving objects, that can lead to odd visual artifacts: the object may move detectably between the reading of one row and the reading of the next.
Slight distortions of the edges of objects in conventional video, though invisible to the naked eye, thus contain information about the objects’ high-frequency vibrations. And that information is enough to yield a murky but potentially useful audio signal.
While this audio reconstruction wasn’t as faithful as those with the high-speed camera, it may still be good enough to identify how many people are speaking in a room, whether they are male or female, and even, given accurate enough information about the acoustic properties of speakers’ voices, who they are.
Keep Reading
Most Popular
DeepMind’s cofounder: Generative AI is just a phase. What’s next is interactive AI.
“This is a profound moment in the history of technology,” says Mustafa Suleyman.
What to know about this autumn’s covid vaccines
New variants will pose a challenge, but early signs suggest the shots will still boost antibody responses.
Human-plus-AI solutions mitigate security threats
With the right human oversight, emerging technologies like artificial intelligence can help keep business and customer data secure
Next slide, please: A brief history of the corporate presentation
From million-dollar slide shows to Steve Jobs’s introduction of the iPhone, a bit of show business never hurt plain old business.
Stay connected
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.