Talking to Your Phone

A startup makes a new entry in the race to build the virtual personal assistant.

Erica Naonearchive page

July 16, 2010

Smart phones promise a lot of computing power and connectivity: We can search the Web and communicate from anywhere. But it can be hard to make full use of all these capabilities on small screens with tiny buttons. Now comes a new wave of applications that combine speech recognition and artificial intelligence to help people carry out simple tasks on their mobile devices.

**Dialing smarter:** When the user speaks a phrase such as “Chinese food,” Vlingo’s new SuperDialer application, above, draws together results from targeted advertising, the user’s personal contacts, and Web search results filtered by location.

The latest such service, from Vlingo, a company that makes voice-recognition applications, tries to go beyond earlier apps by combining a user’s spoken commands with personal data and information stored online . Called “SuperDialer,” the service can, for example, let a user say “Call pizza” and subsequently see a list of nearby pizza places drawn from both the user’s address book and the Web.

The SuperDialer is the first of a series of releases planned by Vlingo. All are intended to add a stronger artificial intelligence backbone to the company’s speech recognition software.

In August, Vlingo hopes to release a social networking application that would connect with a variety of the user’s accounts, including those on sites such as the location-based services Foursquare and Loopt. Users could, for instance, ask aloud where their friends are and retrieve answers.

A separate service in the works, Vlingo Answers, would respond if a user asked a specific question such as “How old is Kiefer Sutherland?” Vlingo would try to get the answers from standard Web search results and scans of specialty information sites such as Wolfram Alpha and True Knowledge.

On the surface, applications like these may seem simple, but CEO Dave Grannan says they involve sophisticated levels of technology. First, the application has to recognize what the user is saying. Then it has to distinguish what the user means–for example, deciding how to interpret words that could have multiple meanings, such as “vets.” Finally, it has to get the information the user needs and provide an easy interface for acting on it.

Grannan says Vlingo’s goal is to help users transform words into actions, so that people don’t have to think about what button to push or exactly how to say what they need a device to do.

This idea is similar to the virtual assistant for the iPhone offered by Siri, a company that was recently acquired by Apple for an undisclosed amount. Siri’s CEO, Dag Kittlaus, often referred to his company’s technology as a “doing engine,” and carefully distinguished its ability to accomplish tasks for the user from the Web’s familiar search functions.

Grannan concedes that Siri’s deep artificial intelligence technology, spun out of research at SRI International in Menlo Park, CA, surpasses the artificial intelligence that Vlingo now uses. However, he still sees a big opportunity for Vlingo to make its mark. Instead of the “inch-wide, mile-deep” approach that he believes characterizes Siri, Grannan hopes that Vlingo can offer artificial intelligence that’s “mile wide and inch deep.” In other words, he says, Siri is adept at a very narrow set of subjects, such as helping people make restaurant reservations, but he wants Vlingo to handle a broader range of topics.

The basic version of Vlingo is free; the Cambridge, MA-based company gets revenue by selling targeted advertising and by charging users for the ability to carry out more sophisticated functions, such as voice recognition for sending text messages. Its application is available for Android, iPhone, BlackBerry, Nokia, and Windows Mobile.

The vision of the personal intelligence agent that apps such as Vlingo and Siri represent has been a major research goal for decades.

Voice recognition and natural language processing have made huge strides in the past decade, enabling computers to better understand what people are saying. But one of the main problems in bringing the technology to smart phones has been that users need to see a device react to voice input within a few seconds in order to feel that an application is working, says Mazin Gilbert, executive director of technical research at AT&T Labs and an expert in these technologies. Smart phones don’t have the processing power needed for sophisticated voice recognition and analysis; any device using such an app is just absorbing audio and sending it over the network. Until very recently, Gilbert says, slow network speeds caused a bottleneck that made apps like SuperDialer impractical.

Today’s voice-recognition butler apps also benefit from access to an abundance of data online and application programming interfaces that let services connect to each other. Gilbert believes, however, that software could still get much more sophisticated about interpreting users’ intentions. He’s excited about the surge of smart phone applications, because they promise to provide much more information about how users want to interact with personal assistants. That could fuel further advances in machine learning and natural language processing, making future applications even smarter and easier to use.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.