Janet maciver and jim baker fell in love when they were both graduate students at New York City’s Rockefeller University. It was the fall of 1970. Janet, a personable and outgoing biophysicist, was studying how information is processed by the nervous system. Jim was an intensely shy mathematician looking for a promising thesis topic.
The third participant in their relationship-the riddle of speech recognition-entered the scene one day when Jim visited Janet’s lab and saw an oscilloscope screen that was displaying a moving wavy line. The signal, Janet explained, was a “continuous log of ongoing events” produced by a type of small analog circuit originally invented by professor Jerome Lettvin at MIT. The “events” on her screen were the sounds of human speech.
“It struck me as a very interesting pattern recognition problem,” Jim says, thinking back on that fateful squiggle. Routed to a speaker, the signal would produce sounds a person could understand: language, in short. But displayed on the screen, the information was impenetrable.
“And as I learned more about it, I learned how difficult the problem really was,” he recalls. The key challenge wasn’t simply building a computer that could identify individual words-a team at Bell Labs had done that back in 1952. Bell’s simple computer could recognize the digits “zero” though “nine” by matching the spoken sounds against a set of patterns stored in analog memory. And by the 1970s, such “discrete” recognition systems-which worked provided that the system was first trained on the speaker’s voice, and that the speaker paused between each word-had built up to a few hundred words.
The real task was to design an algorithm that could make sense of naturally spoken sentences-where individual word sounds are camouflaged by their context (see diagram p. 61). “That [made] it more interesting,” Jim says. Even then, continuous speech recognition struck him as an ideal research problem, which he characterizes as “very difficult but not impossible.”
As Jim and Janet prepared for their wedding in 1971, the U.S. Defense Advanced Research Projects Agency (DARPA) kicked off an ambitious five-year project called Speech Understanding Research. The agency felt that any technology that let soldiers communicate faster with computers could be a significant strategic advantage, especially on the battlefield. The project’s goal: a system that could recognize continuous human speech from a 1,000-word vocabulary with 90 percent accuracy.
The timing of the DARPA initiative was fortuitous for the Bakers, as was Jim’s scientific background. As an undergraduate, he had developed a mathematical technique for analyzing apparently random events, based on methods pioneered by the Russian mathematician Andrey Markov (1856-1922). Jim was the first person to realize that such “Hidden Markov Models” might be used to untangle the speech riddle.
Most newlyweds collaborate to solve challenges such as what pattern to choose for their wedding china. The Bakers didn’t skip these tasks (they chose a dragon), but then decided to tackle the problem of speech recognition together as well. Yet they found themselves increasingly isolated at Rockefeller, which didn’t have experts in speech understanding and lacked the computer power to try out Jim’s techniques. So the next year, they packed their bags and transferred to Carnegie Mellon University, one of the DARPA project’s primary contractors and a hotbed of artificial intelligence (AI) research.