Deep learning turns mono recordings into immersive sound

We’ve had 3D images for decades, but effectively imitating 3D sound has always eluded researchers. Now a machine-learning algorithm can produce “2.5D” sound by watching a video.

Emerging Technology from the arXivarchive page

December 26, 2018

Samuel Dixon | Unsplash

Listen to a bird singing in a nearby tree, and you can relatively quickly identify its approximate location without looking. Listen to the roar of a car engine as you cross the road, and you can usually tell immediately whether it is behind you.

The human ability to locate a sound in three-dimensional space is extraordinary. The phenomenon is well understood—it is the result of the asymmetric shape of our ears and the distance between them.

But while researchers have learned how to create 3D images that easily fool our visual systems, nobody has found a satisfactory way to create synthetic 3D sounds that convincingly fool our aural systems.

Today, that looks set to change at least in part, thanks to the work of Ruohan Gao at the University of Texas at and Kristen Grauman at Facebook Research. They have used a trick that humans also exploit to teach an AI system to convert ordinary mono sounds into pretty good 3D sound. The researchers call it 2.5D sound.

First some background. The brain uses a variety of clues to work out where a sound is coming from in 3D space. One important clue is the difference between a sound’s arrival times at each ear—the interaural time difference.

A sound produced on your left will obviously arrive at your left ear before the right. And although you are not conscious of this difference, the brain uses it to determine where the sound has come from.

Another clue is the difference in volume. This same sound will be louder in the left ear than in the right, and the brain uses this information as well to make its reckoning. This is called the interaural level difference.

These differences depend on the distance between the ears. Stereo recordings do not reproduce this effect, because the separation of stereo microphones does not match it.

The way sound interacts with ear flaps is also important. The flaps distort the sound in ways that depend on the direction it arrives from. For example, a sound from the front reaches the ear canal before hitting the ear flap. By contrast, the same sound coming from behind the head is distorted by the ear flap before it reaches the ear canal.

The brain can sense these differences too. In fact, the asymmetric shape of the ear is the reason we can tell when a sound is coming from above, for example, or from many other directions.

The trick to reproducing 3D sound artificially is to reproduce the effect that all this geometry has on sound. And that’s a tough problem.

One way to measure the distortion is with binaural recording. This is a recording made by placing a microphone inside each ear, which can pick up these tiny variations.

By analyzing the variations, researchers can then reproduce them using a mathematical algorithm known as a head-related transfer function. That turns any ordinary pair of headphones into extraordinary 3D sound machines.

But because everybody’s ears are different, everybody hears sound in a different way. So creating a person’s head-related transfer function means measuring the shape of the person’s ears before playing a recording. And although that can be done in the lab, nobody has worked out how to do it in the wild.

Still, there are ways to approximate 3D sound using the sound distortions that don’t depend on ear shape—the interaural time and level differences.

The trick that Grauman and Gao use is to determine what direction a sound is coming from using visual cues (as humans often do too). So given a video of a scene and mono sound recording, the machine-learning system works out where the sounds are coming from and then distorts the interaural time and level differences to produce that effect for the listener.

For example, imagine a video showing pair of musicians playing a drum and a piano. If the drum is on the left side of the field of view and the piano on the right, it’s straightforward to assume that the drum sounds should come from the left and the piano from the right. That’s what this machine-learning system does, distorting the sound accordingly.

The researchers’ training method is relatively straightforward. The first step in training any machine-learning system is to create a database of examples of the effect it needs to learn. Grauman and Gao created one by making binaural recordings of over 2,000 musical clips that they also videoed.

Their binaural recorder consists of a pair of synthetic ears separated by the width of a human head, which also records the scene ahead using a GoPro camera.

The team then used these recordings to train a machine-learning algorithm to recognize where a sound was coming from given the video of the scene. Having learned this, it is able to watch a video and then distort a monaural recording in a way that simulates where the sound ought to be coming from. “We call the resulting output 2.5D visual sound—the visual stream helps ‘lift’ the flat single channel audio into spatialized sound,” say Grauman and Gao.

The results are impressive. You can watch a video of their work here—be sure to wear headphones while you’re watching.

The video compares the results of 2.5D recordings with monaural recording and shows how good it can be. “The predicted 2.5D visual sound offers a more immersive audio experience,” say Grauman and Gao.

However, it doesn’t produce full 3D sound because of the reasons mentioned above—the researchers don’t create a personalized head-related transfer function.

And there are some situations the algorithm finds difficult to deal with. Obviously, the system cannot deal with any sound source that is not visible in the video. Neither can it deal with sound sources it has not been trained to recognize. This system is focused mainly on music videos.

Nevertheless, Grauman and Gao have a clever idea that works well for many music videos. And they have ambitions to extend its applications. “We plan to explore ways to incorporate object localization and motion, and explicitly model scene sounds,” they say.

Ref: arxiv.org/abs/1812.04204 : 2.5D Visual Sound

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.