“Alexa, Understand Me”

Voice-based AI devices aren’t just jukeboxes with attitude. They could become the primary way we interact with our machines.

George Andersarchive page

August 9, 2017

roman muradov

On August 31, 2012, four Amazon engineers filed the fundamental patent for what ultimately became Alexa, an artificial intelligence system designed to engage with one of the world’s biggest and most tangled data sets: human speech. The engineers needed just 11 words and a simple diagram to describe how it would work. A male user in a quiet room says: “Please play ‘Let It Be,’ by the Beatles.” A small tabletop machine replies: “No problem, John,” and begins playing the requested song.

From that modest start, voice-based AI for the home has become a big business for Amazon and, increasingly, a strategic battleground with its technology rivals. Google, Apple, Samsung, and Microsoft are each putting thousands of researchers and business specialists to work trying to create irresistible versions of easy-to-use devices that we can talk with. “Until now, all of us have bent to accommodate tech, in terms of typing, tapping, or swiping. Now the new user interfaces are bending to us,” observes Ahmed Bouzid, the chief executive officer of Witlingo, which builds voice-driven apps of all sorts for banks, universities, law firms, and others.

For Amazon, what started out as a platform for a better jukebox has become something bigger: an artificial intelligence system built upon, and constantly learning from, human data. Its Alexa-powered Echo cylinder and tinier Dot are omnipresent household helpers that can turn off the lights, tell jokes, or let you read the news hands-free. They also collect reams of data about their users, which is being used to improve Alexa and add to its uses.

Tens of millions of Alexa-powered machines have been sold since their market debut in 2014. In the U.S. market for voice-powered AI devices, Amazon is believed to ring up about 70 percent of all unit sales, though competition is heating up. Google Home has sold millions of units as well, and Apple and Microsoft are launching their own versions soon.

The ultimate payoff is the opportunity to control—or at least influence—three important markets: home automation, home entertainment, and shopping. It’s hard to know how many people want to talk to their refrigerators, but patterns of everyday life are changing fast. In the same way that smartphones have changed everything from dating etiquette to pedestrians’ walking speed, voice-based AI is beginning to upend many aspects of home life. Why get up to lock the front door or start your car heater on a bitterly cold day, when Alexa or her kin can instantly sort things out instead?

For now, Amazon isn’t trying to collect revenue from companies making smart thermostats, lightbulbs, and other Alexa-connected devices. Down the road, though, it’s easy to imagine ways that revenue-sharing arrangements or other payments could catch on. The smallest of these three markets, home automation, already accounts for more than $5 billion of spending each year, while retail sales in the U.S. last year totaled $4.9 trillion. Today Amazon makes money on the machines themselves, at prices ranging from $50 for Dots to $230 for the highest-end Echos with video screens, and reaps a second payoff if users end up shopping more heavily at Amazon’s vast online store. (Amazon won’t disclose those traffic numbers, however.)

For Echos to become as pervasive as smartphones, they will need to do many more things. To that end, Amazon is encouraging independent developers to build new services on the platform, just as Apple has long done with app developers. More than 15,000 such “skills,” or apps, have been built so far, and app-building tools have become so easy to snap together that it’s now possible to build a simple skill in about an hour, without much programming knowledge. Among the most popular apps are ride-hailing options from Uber and Lyft. Duds include 48 separate skills that bombard listeners with insults.

Among the most ambitious developers are companies making hardware or selling services that work with Alexa. Capital One, for example, is offering Alexa-based bill payment to its banking customers; Toronto-based Ecobee is one of a number of makers of smart thermostats to rig up Alexa-powered versions that let people raise or lower room temperatures merely by uttering a few words. “Our customers have busy lives,” says Stuart Lombard, chief executive of Ecobee, which now gets roughly 40 percent of overall sales from its Alexa devices, the 10-year-old company’s fastest-growing product line. “They have to fight traffic to get home, and then they have to feed the kids, diaper the baby, and who knows what else. We give them a hands-free way of getting something done while they’re in the midst of other tasks.”

When speech meets AI

What makes voice-based AI so appealing to consumers is its promise to conform to us, to respond to the way we speak—and think—without requiring us to type on a keyboard or screen. That’s also what makes it so technically difficult to build. We aren’t at all orderly when we talk. Instead, we interrupt ourselves. We let thoughts dangle. We use words, nods, and grunts in odd ways, and we assume that we’re making sense even when we aren’t.

Some people say “No, no, no”; others prefer “Cancel that,” and a third bunch tries some variant of “Wait, actually, here’s what I want instead.” Alexa doesn’t need to decode each utterance.

Thousands of Amazon staffers are working on this challenge, including some at research hubs in Seattle, Sunnyvale, California, and Cambridge, Massachusetts. Even so, Amazon’s careers page recently offered 1,100 more Alexa jobs spread across a dozen departments, including 215 slots for machine-learning specialists. During a meeting at the company’s Cambridge offices, I asked Alexa’s head scientist, Rohit Prasad, why he needs so many people—and when his research team might be fully built out.

“I’m laughing at every aspect of your question,” Prasad replied.

After a few seconds, having regained his composure, Prasad explained that he’s been working on speech technology for 20 years, with frustratingly slow results for most of that period. In the past five years, however, giant opportunities have opened up. Creating a really effective voice-triggered AI is a complex and still unconquered task (see “AI’s Language Problem”). But while in the past, speech scientists struggled to determine the exact meaning of sometimes-chaotic utterances on the first try, new approaches to machine learning are making progress by taking a different tack: they work from imperfect matches at the outset, followed by rapid fine-tuning of provisional guesses. The key is working through large swaths of user data and learning from earlier mistakes. The more time Alexa spends with its users, the more data it collects to learn from, and the smarter it gets. With progress comes more opportunity, and the need for more manpower.

“Let me give you an example,” Prasad said. “If you ask Alexa ‘What was Adele’s first album?’ the answer should be ‘19.’ If you then say, ‘Play it,’ Alexa will know enough to start playing that album.” But what if there’s some conversational banter in the middle? What if you first ask Alexa what year the album came out, and how many copies it sold? Finish such an exchange with the cryptic “Play it,” and earlier versions of Alexa would have been stumped. Now the technology can follow that train of thought, at least sometimes, and recognize that “it” still means “19.”

This improvement comes from machine-learning techniques that reexamined thousands of previous exchanges in which Alexa stumbled. The system learns what song users actually did want to hear, and where the earlier parts of the conversation first identified that piece of music. “You need to make some assumptions at the beginning about how people will ask for things,” says James Glass, head of the spoken--language systems group at MIT. “Then you gather data and tune up your models.”

The case for such a machine-learning approach is widely appreciated, Glass says, but making it work requires far more data than university researchers can easily muster. With Alexa’s usage surging, Amazon now has access to an expansive repository of human-computer speech interactions—giving it the sort of edge in fine-tuning its voice technology that Google has long enjoyed in text-based search queries. Outside data helps, too: a massive database of song lyrics loaded into Alexa in 2016, for example, has helped assure that users asking for the song with “drove my Chevy to the levee” will be steered to Don McLean’s “American Pie.”

One of the newest projects for Prasad’s group highlights the flexibility of this approach. It involves deciphering the moments when users backtrack on their initial requests. Signaling phrases can vary enormously. Some people say “No, no, no”; others prefer “Cancel that,” and a third bunch tries some variant of “Wait, actually, here’s what I want instead.” Alexa doesn’t need to decode each utterance. Large samples and semi-supervised machine learning enable it to outline a cluster of likely markers for negated speech, and then pick up a coherent new request after the change of course.

In addition to making Alexa a better listener, Amazon’s AI experts are using troves of data to make it a better speaker, fine-tuning the cadences of the machine’s synthetic-female voice, in order to boost sustained usage. Traditional attempts at speech synthesis rely on fusing many snippets of recorded human speech. While this technique can produce a reasonably natural sound, it doesn’t lend itself to whispers, irony, or other modulations an engaging human speaker might use. To sharpen Alexa’s handling of everything from feisty dialogue to calm recital, Amazon’s machine-learning algorithms can take a different approach, training on the eager, anxious—and wise-sounding—voices of professional narrators. It helps that Amazon owns the audiobook publisher Audible.

So much to talk about

Among the most ardent adopters of voice-based AI are people who can’t easily type on phones or tablets. Gavin Kerr, chief executive of Philadelphia’s Inglis, which provides housing and services for people with disabilities, has installed Amazon Echo and Dot devices in eight residents’ homes. He hopes to eventually add them to all 300-some residences once pilot testing is complete. “It’s an incredible boon for residents,” Kerr says. “They can be more comfortable. It gives them independence.”

Kerr works with hundreds of people who have multiple sclerosis or other debilitating conditions. For those who are bedridden or use wheelchairs, a hard-to-reach wall thermostat can be a constant source of torment. “Their bodies have a hard time regulating temperature,” Kerr explains. “A room that’s 72 °F may feel hot one hour and cold the next.” With limited mobility, there’s no easy way to get comfortable, especially if round-the-clock assistance isn’t available.

With a bit of tinkering, Alexa’s software can serve even those with severely restricted speech. Kerr tells of one man in his late 30s who wanted to leave a long-term-care facility and move back into an everyday community. “He told us, ‘I’ll never be able to use Alexa’s commands,’” Kerr recalls. “So we asked him, ‘What can you say?’ Then we reworked the software so he could make Alexa work on his terms. Now he says ‘Mom’ when he wants to turn on the kitchen lights, and ‘John’ when he wants to turn on the bathroom lights.”

Although Inglis provides its Echo users with four hours of training, it’s far more common for new users to grope their way along. Pull an Echo out of the box, and a bit of packaging will highlight especially common applications, such as playing music, setting alarms, or updating shopping lists. Organized users can call up Alexa control panels on their smartphones or laptops to adjust settings, hunt for new apps, or get guidance on what prompts will make an app work best.

Alexa’s broader success resides in its ability to alleviate the stresses of an overbooked life. It’s the companion that’s always ready to engage.

In a widely read blog post in June, Microsoft product manager Darren Austin wrote that Alexa’s broader success resides in its ability to alleviate the stresses of an overbooked life. “With the simple action of asking,” Austin wrote, “Alexa relieves the negative emotions of uncertainty and the fear of forgetting.” Users get hooked on bringing all sorts of momentary puzzlements or desires to Alexa, he contended; it’s the companion that’s always ready to engage.

Every week—sometimes more often—Alexa general manager Rob Pulciani scans aggregate data on the most common utterances by Alexa and Dot users. Typically, the top of the list is dominated by requests for music, news, weather, traffic, and games. This past spring, however, a newcomer was rising fast. The trending phrase: “Alexa, help me relax.”

When users make this request, they are steered into a collection of soothing sounds. Birds chirp; distant waves hit the shore; freight trains rumble through the night. Such ambient noise loops can keep playing for hours if users choose. Pulciani had regarded these apps as minor oddities when they first appeared on the Alexa platform, in 2015. But they have rapidly picked up a big following. Stressed-out adults use the sounds to fall asleep. Parents turn them into lullaby substitutes for cranky infants. Over the next few weeks after his discovery, Pulciani and colleagues fine-tuned Alexa’s internal architecture so new Echo buyers could rapidly discover soothing sounds if they asked for pointers about what new skills to try.

Sustained conversation

In studies, AI platforms by Google, Apple, Microsoft, and Amazon all show different strengths. Google Assistant is the best on wide-ranging search commands. Apple’s Siri and Microsoft’s Cortana have other talents. Alexa does particularly well with shopping commands.

The ultimate triumph for voice-based AI would be to carry on a realistic, multi-minute conversation with users. Such a feat will require huge jumps in machines’ ability to discern human speakers’ intent, even when there isn’t an obvious request. Humans can figure out that a friend who says “I haven’t been to the gym in weeks” probably wants to talk about stress or self-esteem. For AI software, that’s a hard leap. Sudden switches in topic—or oblique allusions—are tough, too.

Eager to strengthen ties with the next generation of AI and speech researchers, a year ago Amazon invited engineering students at a dozen universities worldwide to build voice bots that can sustain a 20-minute conversation. The campus making the most progress by this November’s deadline will win a $500,000 prize. I auditioned a half-dozen of these bots one weekend, moving each time from simple queries to trickier open-ended statements of opinion that invited all kinds of possible replies. We got off to a good start when one bot asked me, “Did you see any recent movies?” “Yes,” I replied, “we saw Hidden Figures.” Rather than mimic newspaper reviews of this poignant film about NASA’s early years, the social bot shot back: “I thought Hidden Figures was very thin on the actual mathematics of it all.” Not my take on the film, but it seemed like a charmingly appropriate thing for an AI program to say. Our conversation stalled out soon afterward, but at least we had that brief, beautiful moment.

Alas, none of the other bots could come close. The most confused one blurted out sentences such as “Do you like curb service?” when I thought we were trying to talk about Internet sites. I said something perhaps a little sharp about the bot’s limitations, only to be asked: “Can you collective bargaining?”

A few days later, when I asked Amazon’s Prasad for his take on the social bots, none of their early failings bothered him. “It’s a super-important area,” he told me. “It’s where Alexa could go in terms of being very smart. But this is way harder than playing games like Go or chess. With those games, even though they have a lot of possible moves, you know what the end goal is. With a conversation, you don’t even know what the other person is trying to accomplish.” When Alexa is able to figure that out, we will really be talking.

George Anders has covered Amazon for national publications since the late 1990s. His newest book is You Can Do Anything.

Deep Dive

Humans and technology

Building a more reliable supply chain

Rapidly advancing technologies are building the modern supply chain, making transparent, collaborative, and data-driven systems a reality.

MIT Technology Review Insightsarchive page

Backed by heritage, ready for the future

Building a data-driven health-care ecosystem

Harnessing data to improve the equity, affordability, and quality of the health care system.

MIT Technology Review Insightsarchive page

Let’s not make the same mistakes with AI that we made with social media

Social media’s unregulated evolution over the past decade holds a lot of lessons that apply directly to AI companies and technologies.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.