The Grammar of Sound

New software lets you index and search audio much faster than in the past.

John Harneyarchive page

April 30, 2003

Imagine you were a transcriptionist at the federal government’s trial of Microsoft last year. Say you were trying to find instances of when Bill Gates testified between May 15 and June 1. Using existing tools like full-text search engines, natural language query or speech recognition, you’d have to transcribe the audio into a text file, then index it with a lexicon of terms that included “Gates.” Such an undertaking would have been labor-intensive, time-consuming, and error-prone. But only then could congressmen quickly locate testimony in which they were interested.

The key to expediting the process was eliminating the need for transcription or indexing or both. This has long appeared to be an insoluble problem. But a company called Fast-Talk Communications that spun out of Georgia Tech has created a way for users to locate subject matter in an actual audio file simply by phonetically spelling and entering any term they want to find.

Say, for example, that you want to locate the word “Sudetenland” in an audio account of events leading up to World War II. According to Mark Clements, co-founder of the Atlanta-based company, you’d simply “sound out what Sudetenland sounds like. Take the name, Sue,’ the city, Dayton,’ and the word, land,’ and string those together, type it in. That gets resolved into the set of phonemes you’re looking for” (phonemes are units of sound in any language of which all its words are phonetically comprised). The Fast-Talk software finds the string of phonemes that correspond to the letters you enter and guides you to all spoken references to Sudetenland in the audio file. Because this tool bypasses the whole transcription and indexing process, it delivers results fast. According to Clements, the system processes “on the order of 30 hours of material per second.”

This is important, says Dan Rasmus, an analyst at the market research firm Giga/Forrester, because “voice is one of those untapped resources that companies have.” Jackie Fenn, who follows emerging technologies at Gartner, contends that Fast-Talk’s “main value is in tapping into audio streams that you probably wouldn’t really be able to get access to” otherwise. “It’s not cost-effective to have a human do that,” Fenn says.

Speed is not the only advantage that the Fast-Talk method has over alternative methods. With speech recognition, says Clements, “your goal is to take audio that’s input speech and either recognize what was spoken according to some very constrained grammar, or you can use a natural language approach and try to find the sequence of words that is most probable.” If the term is not in a system’s spoken lexicon, however, or if you are uncertain about the sequence of words in text form you’re looking for, the search may prove fruitless. By contrast, Clements says, the Fast-Talk approach “processes the speech in such a way that you can later go back and search it very efficiently for any set of sounds-they don’t have to have any lexical existence at all.”

Fenn believes this is a much more versatile approach. Fast-Talk, she says, is “focusing on the pure audio aspect” of spoken communications, which she says “seems to give them a faster algorithm and greater flexibility in dealing with new items that aren’t in the vocabulary.” Fast-Talk also handles accents well. The key is having the software practice recognizing phonemes in the accent with which you’ll be dealing. For instance, a system trained by a speaker from Canada would transcribe the sound “hoos” into the word “house.”

Fast-Talk’s software does not, however, employ certain strategies that natural language queries might-leading to some shortcomings. Proximity searches-for instance, where a natural language tool would recognize a word like “Georgia,” because it usually occurs right after the word, “Atlanta”-are not possible. Also, since Fast-Talk recognizes by sound, not spelling, it cannot distinguish homonyms. “We would not be able to tell the difference between the word, discreet’ as in cautious versus discrete’ as in individual items,” says Clements. Another disadvantage, cautions Fenn, is that the system is not amenable to text mining. “If you wanted to look for patterns and clusters of related concepts, you couldn’t. You’d have to have a transcript.”

Those qualifications aside, however, the technology appears to offer significant benefit in several applications. Television and radio networks have thousands of hours of programming but no fast way to index and negotiate them. “If you want to find where, say, an NPR news account talked about a panda,” says Clements, “it just takes forever to do that right now.” Another potentially hot use is in call centers. Rasmus says “call centers want to know if anyone had a conversation about X kind of product. Looking at those voice recordings as a means for getting information to somebody at a call center who’s trying to help a client is incredibly time saving, he says. Office workers might ultimately find the technology useful too, Clements says. “Imagine that you have all your voicemail as audio files interleaved with all of your e-mail,” he says. “Our tool would make it so you could manage it.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.