The Era of Ubiquitous Listening Dawns

Moto X strengthens the hand of apps and technologies that emphasize listening to everything, all the time.

David Talbotarchive page

August 8, 2013

The Moto X, the new smartphone from Google’s Motorola Mobility, might be remembered best someday for helping to usher in the era of ubiquitous listening.

Unlike earlier phones, the Moto X includes two low-power chips whose only function is to process data from a microphone and other sensors—without tapping the main processor and draining the battery. This is a big endorsement of the idea that phones could serve you better if they did more to figure out what is going on (see “Motorola Reveals First Google-Era Phone”). For instance, you might say “OK Google Now” to activate Google’s intelligent assistant software, rather than having to first tap the screen or press buttons to get an audio-processing function up and running.

This brings us closer to having phones that continually monitor their auditory environment to detect the phone owner’s voice, discern what room or other setting the phone is in, or pick up other clues from background noise. Such capacities make it possible for software to detect your moods, know when you are talking and not to disturb you, and perhaps someday keep a running record of everything you hear.

“Devices of the future will be increasingly aware of the user’s current context, goals, and needs, will become proactive—taking initiative to present relevant information,” says Pattie Maes, a professor at MIT’s Media Lab. “Their use will become more integrated in our daily behaviors, becoming almost an extension of ourselves. The Moto X is definitely a step in that direction.”

Even before the Moto X, there were apps, such as the Shazam music-identification service, that could continually listen for a signal. When users enable a new feature called “auto-tagging” on a recent update to Shazam’s iPad app, Shazam listens to everything in the background, all the time. It’s seeking matches for songs and TV content that the company has stored on its servers, so you can go back and find information about something that you might have heard a few minutes ago. But the key change is that Shazam can now listen all the time, not just when you tap a button to ask it to identify something. The update is planned for other platforms, too.

But other potential uses abound. Tanzeem Choudury, a researcher at Cornell University, has demonstrated software that can detect whether you are talking faster than normal, or other changes in pitch or frequency that suggest stress. The StressSense app she is developing aims to do things like pinpoint the sources of your stress—is it the 9:30 a.m. meeting, or a call from Uncle Hank?

Similarly, audio analysis could allow the phone to understand where it is—and make fewer mistakes, says Vlad Sejnoha, the chief technology officer of Nuance Communications, which develops voice-recognition technologies. “I’m sure you’ve been in situation where someone has a smartphone in their pocket and suddenly a little voice emerges from the pocket, asking how they can be helped,” he says. That’s caused when an assistance app like Apple’s Siri is accidentally triggered. If the phone’s always-on ears could accurately detect the muffled acoustical properties of a pocket or purse, it could eliminate this false start and stop phones from accidentally dialing numbers as well. “That’s a work in progress,” Sejnoha says. “And while it’s amusing, I think the general principle is serious: these devices have to try to understand the users’ world as much as possible.”

A phone might use ambient noise levels to decide how loud a ringtone should be: louder if you are out on the street, quiet if inside, says Chris Schmandt, director of the speech and mobility group at MIT’s Media Lab. Taking that concept a step further, a phone could detect an ambient conversation and recognize that one of the speakers was its owner. Then it might mute a potentially disruptive ringtone unless the call was from an important person, such as a spouse, Schmandt added.

Schmandt says one of his grad students once recorded two years’ worth of all the sounds he was exposed to—capturing every conversation. While the speech-to-text conversions were rough, they were good enough that he could perform a keyword search and recover the actual recording of a months-old conversation.

How far could this go? Much will depend on the willingness of phone owners to let their apps transmit audio of their environments over the wireless network. People skittish about surveillance might have second thoughts.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.