Computers That Speak Your Language

Voice recognition that finally holds up its end of a conversation is revolutionizing customer service. Now the goal is to make natural language the way to find any type of information, anywhere.

Wade Rousharchive page

June 1, 2003

“IwanttoflyfromBostontoMilwaukeenext
Saturdayformysisters
birthdayandIdontwanttostopin
ChicagoandIdontwantto
paymorethanfourhundreddollars
andthepartystartsatthreeoclocksoI
needtogettherebeforethen.”

Say that to a human airline agent nicely, and he or she will quickly disentangle your words and find flights that meet your criteria. Say it to the airline’s automated reservations line, however, and all you’re likely to get is a cheery digital voice intoning, “Sorry, I didn’t catch that.”

Don’t blame the voice. Even assuming the airline’s computers overcame the garbled words, background noise, and Boston accent to render the request into accurate text, no language-processing system has the computational firepower to make sense of your price and routing constraints, ignore irrelevancies like the fact that Saturday is your sister’s birthday, and understand that if the party starts at 3:00 p.m., you’re not interested in flights that arrive in Milwaukee at 4:00.

If computers could understand and respond to such routine natural-language requests, the results would be win-win: airlines wouldn’t need to hire so many agents, and consumers wouldn’t have to struggle with the confusion of touch-tone interfaces that leave them furiously tapping the “0” button, vainly trying to reach a live operator.

Futurists have been envisioning such a world since at least 1968, when 2001: A Space Odyssey’s HAL 9000 became the archetypal voice-interactive computer. Academic and corporate researchers intrigued by the sheer coolness of the idea have been tinkering for just as long with systems for recognizing and responding to human speech. But technologies don’t take hold because they’re cool: they need a business imperative. For language processing, it’s the enormous expense of live customer service that’s finally driving the technologies out of the lab. Simple “press or say one’ ” phone trees are rapidly heading for the scrap heap as companies such as Nuance Communications and SpeechWorks meld previously competing strategies into software that infers the intention behind people’s naturally spoken or written requests. Major airlines, banks, and consumer-goods companies are already using the systems, and while the technology can’t yet hold up its end of a conversation, it does help callers with simple questions avoid long queues-and frees human agents to deal with more complex requests.

Such improvements have set up natural-language systems for explosive growth: 43 percent of North American companies have either purchased interactive voice response software for their call centers or are conducting pilot studies, according to Forrester Research, a technology analysis firm. As more companies replace their old touch-tone phone menus, today’s $500 million market for telephone-based speech applications will grow-reaching $3.5 billion by 2007, according to Steve McClure, a vice president in the software research group at market analysis firm IDC. In late 2002, for example, Bell Canada installed a $4.5 million voice response system built by Menlo Park, CA-based Nuance. “Based on the results we’re seeing, the actual return on investment will take only about 10 months,” says Belinda Banks, Bell Canada’s associate director of customer care. Overall, the company expects to save $5.3 million in customer service costs this year alone.

And this is only phase one in the deployment of language-processing systems. Companies like Nuance and Boston’s SpeechWorks, the two market leaders in interactive voice response systems, are succeeding partly because they’ve tailored their technologies for narrow domains-such as travel information-where the vocabularies and concepts they must master are restricted. Even as such systems take over the customer service niche, other companies are still pursuing the challenge of true natural-language understanding. If research efforts at IBM and the Palo Alto Research Center (PARC), for example, bear fruit, computers may soon be able to interpret almost any conversation, or to retrieve almost any information a Web user wants, even if it’s locked away in a video file or a foreign language-opening markets wherever people seek knowledge via computer networks. Predicts IDC’s McClure, “Whereas the GUI [graphical user interface] was the interface for the 1990s, the NUI, or natural’ user interface, will be the interface for this decade.”

Say What?

Building a truly interactive customer service system like Nuance’s requires solutions to each of the major challenges in natural-language processing: accurately transforming human speech into machine-readable text; analyzing the text’s vocabulary and structure to extract meaning; generating a sensible response; and replying in a human-sounding voice.

Scientists at MIT, Carnegie Mellon University, and other universities, as well as researchers at companies like IBM, AT&T, and the Stanford Research Institute (now SRI International), have struggled for decades with the first part of the problem: turning the spoken word into something computers can work with. The first practical products came in the early 1990s in the form of consumer speech recognition programs-such as IBM’s Voice Type-that took dictation but forced users to pause after each word, limiting adoption. By the mid-1990s, the technology had advanced and led to dictation systems such as Dragon Systems’ NaturallySpeaking and IBM’s ViaVoice, which can transcribe unbroken speech with up to 99 percent accuracy.

Around the same time, a few scientists broke away from academic and corporate labs to create startups aimed at tackling the even more complex problems-and bigger potential markets-of the second area of language processing, dubbed “language understanding.” It’s largely advances in this area that have positioned the field for its real growth spurt. These advances rest on two important realizations, according to SpeechWorks chief technology officer Michael Phillips, a former research scientist at MIT’s Laboratory for Computer Science. The first was that there’s little point in reaching for the moon-the decades-old dream of systems capable of HAL-like general conversation. “There is a myth that people want to talk to machines the same way they talk to people,” Phillips says. “People want an efficient, friendly, helpful machine-not something that’s trying to trick them into thinking they’re having a conversation with a human.” This assumption vastly simplifies the job of building and training a natural-language system.

The second realization was that the time had come to combine philosophies long held by rival factions in the language-processing community. One philosophy essentially says that understanding speech is a matter of discerning its grammatical structure, while the other holds that statistical analysis-matching words or phrases against a historical database of speech examples-is a more efficient tool for guessing a sentence’s meaning. Hybrid systems that use both methods, the startups have learned, are more accurate than either approach on its own.

But this insight didn’t arrive overnight. At MIT, Phillips had helped develop experimental software that could recognize speech and, based on its understanding of grammar, make sense of a request and reply logically. Like other grammar-based systems, it broke a sentence into its syntactic components, such as subject, verb, and object. The system then arranged these components into treelike diagrams that represented a sentence’s semantic content, or internal logic-who did what to whom, and when. The software was limited to helping users navigate around Cambridge, MA, Phillips explains. “You’d say, Where’s the nearest restaurant?’ and it would say, What kind of restaurant do you want?’ You would say, Chinese,’ and it would find you a place.”

Shortly after Phillips licensed the technology from MIT in 1994 and left to start SpeechWorks, both he and researchers at competitor Nuance saw that one of their target applications, call steering, required something more. “There are companies out there that have 300 different 800 numbers,” Phillips explains. “The customer doesn’t understand the structure of the organization-they just know what problem they have. The right thing to do is to ask a question, like, What’s the problem you’re having?’” But compared to a request for a nearby Chinese restaurant, such questions are perilously open ended.

The problem gets harder when one considers that the ambiguity of much human speech-think of a phrase like “he saw the girl with the telescope”-means that many requests are open to multiple interpretations. “There are so many different ways that somebody could speak to the system that trying to cover all that in grammars is prohibitive,” says John Shea, vice president for marketing and product management at Nuance.

SpeechWorks finally found a workable solution in 2000, when it married the MIT software with a statistical language-processing technology developed at AT&T Labs-Research in Florham Park, NJ. AT&T’s system is built around a database of common sentence fragments drawn from tens of thousands of recorded telephone calls involving both human-to-human and human-to-machine communication. Each fragment in the database is scored for its statistical association with a certain topic and classified accordingly. A fragment such as “calls I didn’t make,” for instance, might correlate strongly with the topic “unrecognized-number billing inquiries,” and the system would route the call to an agent who could credit the caller’s account. If the system isn’t confident about its choice, it prompts the caller for more information using speech synthesis technology. In the end, according to AT&T, the system correctly routes more than 90 percent of calls-a far higher success rate than callers experience when navigating old-fashioned phone trees on their own.

Nuance developed a similar system, based on technology from SRI, which can use either grammatical or statistical methods, or both, to extract meaning from a caller’s speech. “We use different approaches depending on the customer’s needs,” says Felix Gofman, a product-marketing manager at Nuance. “You can mix and match.” In a specific field, such as banking, the topics and vocabulary of callers’ questions will be limited, and the system can operate solely using predefined lists of what customers might say. For new or wider-ranging fields such as ordering phone service, the system stores each question it hears in a database, then uses statistical techniques to compare new questions to old entries in a search for probable matches-thereby improving accuracy over time.

SpeechWorks’ call center technology is used by such diverse enterprises as Office Depot, the U.S. Postal Service, Thrifty Car Rental, and United Airlines. But the company pushing the technology closest to its limits is Amtrak. Travelers calling Amtrak’s automated telephone system can not only get train schedules but also book reservations and charge tickets to their credit cards. “When we set out, the primary goal was to increase customer satisfaction rates,” says Matt Hardison, the railroad’s chief of sales, distribution, and customer service. But as a bonus, he says, the savings in labor costs repaid Amtrak’s $4 million investment in the technology within 18 months.

Nuance, meanwhile, has big customers in the financial and telecommunications industries, including Schwab, Sprint PCS, and Bell Canada. British Airways told the company that after deploying Nuance speech recognition systems last year, its average cost per customer call dropped from $3.00 to $.16. And according to Bell Canada’s Banks, 40 percent of customers used to “zero out,” or request a live operator, while navigating the company’s touch-tone phone tree. Between the company’s December 2002 implementation of the system and March 2003, that number dropped to 15 percent, says Banks.

A Deeper Understanding

For all their success, however, in no sense do these systems really “understand” what they hear. They deal only with rules of grammar, probabilities, and stored examples. Indeed, they excel precisely because their makers have turned away from the quest for a system intelligent enough to read and summarize a book or sustain a general conversation.

But other researchers retain a broader view of the possibilities for natural-language processing. Like Ron Kaplan, a research fellow at PARC who developed much of the basic grammatical theory behind many of today’s natural-language systems, they are building software that can cope with a far greater variety of inputs-from newspaper stories to the disorganized mass of multimedia information on the Web. Kaplan is critical of what he calls the “shallow methods” used for niche applications like call steering. “Compared to the alternative”-maintaining a costly staff of human customer-service agents-“they are actually not bad,” he says. “But compared to what you would like, they stink.” A more effective natural-language interface, Kaplan says, would eliminate the need to carefully tailor the systems and allow users to speak or write freely.

Two problems hindering that vision, in Kaplan’s view, are that the databases of language samples upon which simpler systems draw are too small, and the statistical algorithms they use are designed to eliminate the ambiguity in much of what people say, homing in as quickly as possible on the most likely meaning. Kaplan believes that if this ambiguity is eliminated too soon, the correct meaning of an utterance-especially a long or complex sentence-may be lost. So he has spent the last decade working on a grammar-driven system, called the Xerox Linguistic Environment, which actually tries to preserve ambiguity. The system parses an utterance into every possible sentence diagram allowed under a set of 314 rules governing relationships between various parts of speech (PARC researchers assembled the rules manually over three years). A complex sentence with 40 or more words, for instance, might be interpreted in as many as 1,000 different ways.

The system’s grammar analysis is so thorough that it correctly captures, on average, 75 percent of the logical relationships in a sentence-which is “actually very high compared to what most statistical methods do,” says Kaplan. That accuracy rate can be increased to about 80 percent if the software takes advantage of those statistical methods, comparing each possible interpretation to similar diagrams in a “trained” database-in the PARC software’s case, a store of hundreds of thousands of accurate diagrams of sentences drawn from Wall Street Journal articles.

Kaplan plans to first unleash the system on Xerox’s huge digital knowledge base of copier repair techniques, which is constantly consulted and updated by the company’s field technicians. There it will compare thousands of individual entries in order to weed out redundancies and contradictions. “It could be that a lot of technicians have discovered the same solution to a common problem,” such as replacing a copier’s drum, Kaplan explains. “You get a bunch of entries saying the same thing, only in different ways.” Finding and pruning out such redundancy automatically, he adds, can help technicians spend less time sorting through options. The software could also eventually become the core of an advanced system for translating documents into different languages-a task particularly plagued by ambiguity (see “The Translation Challenge”).

Before a computer can understand or translate stored information expressed in natural language, however, it has to find it. That’s getting more difficult as the digital universe expands-which is why IBM is pursuing an ambitious project to employ natural-language processing in the management of “unstructured information,” the mass of digital text, images, video, and audio stored on computer networks. Much of IBM’s business rests on its database product, DB2, but a traditional database can only retrieve information that has already been organized and indexed. IBM wants to give business users and consumers immediate access to the unindexed data languishing on millions of hard drives around the world, effectively extending its dominance in structured-data management into the realm of unstructured information. To get there, the company is pursuing an initiative designed to merge different language-processing approaches into powerful software that can intelligently search, organize, and translate all this data. The project, called the Unstructured Information Management Architecture, could fuel the company’s business well into the Internet age. “As research bets go, this is a big one,” says Alfred Spector, the division’s senior vice president.

Translation software and other products that use the new architecture are still in the prototype stage. But ultimately, says David Ferrucci, the project’s lead software architect, the architecture will help IBM build systems that pluck the latest information a user wants from any digital source, in any language, and deliver it in organized form. Already, U.S. companies spend $900 million a year on “enterprise information portals” that help employees find the records they need, according to Giga Information Group in Cambridge, MA, and the opportunities for IBM and other companies developing software for managing unstructured information will only multiply as that information accumulates. “There is now clearly a business rationale to deal with unstructured data,” concludes Spector.

If efforts to cope with ambiguity, unstructured information, and other complexities of language succeed, we might ultimately stop treating computers like toddlers, simplifying everything we say to fit their immature understanding of the world. When that day comes, and it could come soon, consumers can expect to find automated voice interfaces at every turn, allowing them to use plain English (or French or Chinese) to interact with everything from Web archives to appliances and automobiles.

And that would really be something to talk about.

Language Processing’s Babel COMPANY TECHNOLOGY LOCATION AT&T Automated speech recognition; natural-sounding speech synthesis
New York, NY Banter Automated e-mail classification and response San Francisco, CA, and Jerusalem, Israel IBM Automated speech recognition;
translation; standard architectures for managing unstructured information Armonk, NY Intel Audiovisual speech recognition Santa Clara, CA Inxight Software for discovering, exploring, and categorizing text data on corporate networks Sunnyvale, CA iPhrase Technologies Natural-language text searching of corporate Web sites Cambridge, MA Microsoft Grammar checking; query interfaces; translation Redmond, WA Nuance Communications Interactive voice response systems for telephone-based customer service Menlo Park, CA Palo Alto Research Center Improved algorithms for extracting meaning from written text Palo Alto, CA SpeechWorks Interactive voice response systems for telephone-based customer service Boston, MA StreamSage Natural-language search and indexing of video and audio material Washington, DC

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.