Computer Speakers for Your Ears Only

Microsoft researchers are developing an algorithm that would allow speakers to work like virtual headphones–even as you walk around your office.

Kate Greenearchive page

March 21, 2007

More and more people are using their computers for voice communication, such as Skype and audio instant messaging. For the most part, however, using these features requires one either to be tethered to her computer by a headset or to speak directly into a microphone and keep the speaker volume low, especially in shared office space.

**Hear this:** Microsoft researchers have developed an algorithm that adjusts the timing of sound waves emitted from each speaker in an array (seen here). As a result, sound waves cancel each other out in some parts of space, and amplify each other in others, effectively creating a focused beam of sound that acts as virtual headphones.

In light of that problem, researchers at Microsoft are trying to make audio output more sophisticated. A team, led by Ivan Tashev, a software architect at Microsoft, recently began work on an algorithm that, in theory, will be able to direct sound from a set of speakers–ideally embedded in a computer monitor–into a person’s ears, effectively creating virtual headphones; just a few inches outside the focal point of the sound waves, the volume dramatically fades away. Crucially, says Tashev, his algorithm could be used by a wide range of inexpensive speakers that could be put into computer monitors.

The goal, he says, is to “target focused sound so that a person can walk around an office and hear” while on a video- or computer-aided audio conference call. Information about a person’s location could be collected by hardware peripherals and fed back into the speaker software, allowing the virtual headphones to move with the user in real time. For example, Tashev says, a camera, either mounted on or embedded in a computer monitor, and image-processing software could determine a person’s position. In addition, an array of four or more microphones on or near a computer monitor could be programmed to localize sound by measuring the subtle time differences among when sound arrives at each speaker in the array. In fact, Tashev’s previous work has been to design such sound-localizing algorithms for the types of microphones that are commonly found in the bezel of laptop computers. Employing both a camera and a microphone can improve the accuracy and distance a person could roam while using the speakers.

To be sure, the idea of focusing sound isn’t new: military radar systems and common ultrasound equipment, used to image fetuses in utero and find cancerous tumors, have done this for years. The technology is called beamforming, and it is achieved when the sound waves from certain speakers in an array experience microsecond delays, explains Jiashu Chen, technical manager at Finisar Corporation, a data-communications company based in Sunnyvale, CA. The delayed sound waves combine in such a way that in some parts of space, the sound is canceled out, and in others, the sound grows louder.

However, beamforming systems that direct audible sound, such as music or human voices, are more technically challenging to build than radar and ultrasound are, says Chen, because they must accommodate a wider range of frequencies; lower frequencies require different hardware and software considerations than higher frequencies do. Signal-processing technology has improved to the point that some commercial products use beamforming. Yamaha, for example, sells speakers for home entertainment that bounce focused sound off walls to create virtual speakers behind a listener’s head. But such systems are still rare, and always pricey.

One reason that audio beamforming is expensive is because it is time-consuming to calibrate a given real-world system, says Microsoft’s Tashev. Each speaker has slight variations in the sound it emits, and since focusing a sound beam requires extreme precision in timing, these slight variations can cause large distortions in sound. Therefore, the software used to focus the sound is calibrated to work with specific hardware, and when it’s purchased, the whole system needs to be calibrated to the shape of the room in which it’s installed.

Microsoft wants to develop software that’s good enough to work with any speakers, with a minimal amount of calibration required at the factory or by users. To allow generic speakers to focus sound, Tashev and his group have modified well-known beamforming algorithms. They designed a part of a signal-processing algorithm, called a filter, to accommodate a wide range of manufacturing tolerances, or the data that describe speaker performance at various frequencies. “You have to know how those parameters vary,” Tashev says. “When you design the algorithm, you do it for multiple instances of speaker arrays.”

The trick, he says, is to try to find a happy medium among the different tolerances so that the resultant sound is comparable across speakers. This requires some fine-tuning, and the researchers are still determining the best way to implement the speaker tolerances. However, Tashev concedes, by making a generic beamforming algorithm, there will most likely be a trade-off in performance. “You have to make some compromises,” he says.

Tashev points out that the project is still in its early stages. “Even if you have a good beamformer, it’s not enough,” he says. “You also have to have a sound localizer [such as a camera or specialized microphone array] that tells you where to point the beam.” Moreover, he says, in order for the beamforming algorithm to be successful, it would need to take into account sound reflections from walls and windows within an office.

“It’d be neat to see this out there,” says Stan Birchfield, professor of electrical and computer engineering at Clemson University, in Clemson, SC. Birchfield works on image-processing techniques that use cameras to identify a person’s location to improve the focus of microphone arrays. “Tracking is a really hard problem,” he says, one that no one has found a way to solve for an environment like an office. It’s encouraging that Microsoft is exploring the area, Birchfield adds, but until the company has plans for products, he’s “cautious of getting enthusiastic.”

Tashev says that commercialization of this technology will require a complex coordination of many factors that could take up to three years to achieve even if a research prototype has been perfected. Even that step will take time: Tashev says the group still needs to test the reliability of the algorithm with a number of speaker arrays. Then, in order to turn the work into a product, Microsoft will need to find the best way to integrate the algorithm into Windows Media Player, make sure drivers for the hardware are included in the operating system, and, Tashev says, find companies that are interested in manufacturing speakers for such an application. But if and when all this happens, the payoff will be great, he says. People will no longer need headsets to have a private Skype conversation or video conference.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.