Recovering intelligible speech by photographing the vibrations of a potato-chip bag sounds like the stuff of a spy novel. But researchers at MIT, Microsoft, and Adobe did just that—from 15 feet away through soundproof glass—using an algorithm they developed that reconstructs an audio signal by analyzing minute vibrations of objects depicted in video. They also extracted useful audio signals from videos of aluminum foil, the surface of a glass of water, and the leaves of a potted plant.
“When sound hits an object, it causes the object to vibrate,” says Abe Davis, a graduate student in electrical engineering and computer science and first author on a paper presented at the computer graphics conference Siggraph. “The motion of this vibration creates a very subtle visual signal that’s usually invisible to the naked eye. People didn’t realize that this information was there.”
Reconstructing audio from video requires that the frequency of the video samples—the number of frames captured per second—be higher than the frequency of the audio signal. For the video shot through soundproof glass, the researchers used a high-speed camera that captured 2,000 to 6,000 frames per second. That’s much faster than the 240 frames per second possible with the latest iPhone, but well below the frame rates of the best commercial high-speed cameras, which can top 100,000 frames per second.
In other experiments, they used an ordinary digital video camera, with a rate of only 60 frames per second. The sensor of a digital camera consists of an array of photodetectors—millions of them, even in inexpensive devices. As it turns out, it’s less expensive to design the sensor so that it reads off the measurements of one row of photodetectors at a time. With fast-moving objects, that can lead to odd visual artifacts: the object may move detectably between the reading of one row and the reading of the next.
Slight distortions of the edges of objects in conventional video, though invisible to the naked eye, thus contain information about the objects’ high-frequency vibrations. And that information is enough to yield a murky but potentially useful audio signal.
While this audio reconstruction wasn’t as faithful as those with the high-speed camera, it may still be good enough to identify how many people are speaking in a room, whether they are male or female, and even, given accurate enough information about the acoustic properties of speakers’ voices, who they are.
The inside story of how ChatGPT was built from the people who made it
Exclusive conversations that take us behind the scenes of a cultural phenomenon.
ChatGPT is about to revolutionize the economy. We need to decide what that looks like.
New large language models will transform many jobs. Whether they will lead to widespread prosperity or not is up to us.
Sam Altman invested $180 million into a company trying to delay death
Can anti-aging breakthroughs add 10 healthy years to the human life span? The CEO of OpenAI is paying to find out.
GPT-4 is bigger and better than ChatGPT—but OpenAI won’t say why
We got a first look at the much-anticipated big new language model from OpenAI. But this time how it works is even more deeply under wraps.
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.