Microsoft’s “3-D Audio” Gives Virtual Objects a Voice

Headphones that make sounds seem to come from specific points in space could be the perfect counterpoint to virtual reality goggles.

Tom Simonitearchive page

June 4, 2014

Just as a new generation of virtual reality goggles for video games are about to hit the market, researchers at Microsoft have come up with what could be the perfect accompaniment—a way for ordinary headphones to create a realistic illusion of sound coming from specific locations in space.

In combination with a virtual reality device like the Oculus Rift, the new system could be used to make objects or characters in a virtual world sound as well as look like they are at a specific point in space, even if that is outside a person’s field of view. Microsoft’s researchers refer to the technology as 3-D audio.

In a demonstration of the technology at Microsoft’s Silicon Valley lab, I put on a pair of wireless headphones that made nearby objects suddenly burst into life. A voice appeared to emanate from a cardboard model of a portable radio. Higher quality music seemed to come from a fake hi-fi speaker. And a stuffed bird high off the ground produced realistic chirps. As I walked around, the sounds changed so that the illusion never slipped as their position relative to my ears changed.

That somewhat eerie experience was made possible because less than a minute earlier I had sat down in front of a Kinect 3-D sensor and been turned briefly to the left and right. Software built a 3-D model of my head and shoulders and then used that model to calculate a personalized filter that made it possible to fool my auditory senses.

**Listen in**: Microsoft researcher David Johnston tests a system that makes sounds seem to originate from specific points in space. Electronics on the top of the headphones have sensors to track the motion of the wearer’s head.

Once such a filter has been recorded, it could be used by many different types of device or software, says Ivan Tashev, the researcher at Microsoft’s Redmond labs working on the project with colleague David Johnston. “You can use this for virtual reality and augmented reality,” he says.

To work properly, Tashev’s system also needs data on the position of the headphones as a person moves his head, provided by motion sensors and a watching camera. However, the kind of motion sensors used by virtual reality headset like the Oculus Rift could provide enough information (see “10 Breakthrough Technologies 2014”).

Tashev’s system is a new twist on an old idea. It has long been known that the unique shape and position of a person’s ears and the anatomy of their head alters how sound reaches the ear canals, a process described as a “head related transfer function,” or HRTF. A system programmed with a description of these parameters can trick a person into perceiving sound as coming from a particular location.

Unfortunately, capturing a person’s HRTF is difficult. The most accurate way is to use earplugs fitted with an array of microphones to record exactly what reaches a person’s ears when calibration sounds are played, but that isn’t practical outside a lab. Video game developers create spatial audio effects using average HRTFs, but they can’t offer a very accurate illusion to most people.

When Tashev quickly scans a person’s head, his software generates an approximation of that subject’s HRTF that seems good enough to produce unusually accurate spatial audio. “Essentially we can predict how you will hear from the way you look,” he says. “We work out the physical process of sound going around your head and reaching your ears.” The software that does that was created by capturing accurate HRTFs for 250 people and then comparing them with 3-D scans of their heads.

Tashev says he is now working to improve the capture system and make it smooth and speedy enough to be something a person with a Kinect camera might be able to do at home.

Mark Billinghurst, a professor and leader of the Human Interface Lab at the University of Canterbury, New Zealand, says that the approach developed by Microsoft could have a broad impact if the scanning process can be made practical enough. And being able to deploy 3-D audio tricks on voices, notifications, and sounds in games on smartphone headsets or devices like Google Glass could make them easier to interact with, he says.

“It probably won’t give as accurate results as a very detailed HRTF, but could still be a lot better than can be offered in games and other media today,” Billinghurst says. “That could help with how immersed you feel in a game or virtual environment or even with wearable devices.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.