Sound and Vision
In the future, members of Project Oxygen say, computing power will cost next to nothing. That means that computation-heavy technologies, such as vision systems and software that understands spoken requests, will be able to replace standard mouse-and-keyboard interfaces. “We have to extend the modality beyond pointing and clicking,” says Victor Zue, ScD ‘76, codirector of the lab and-along with Anant Agarwal and Rodney Brooks-one of the leaders of Project Oxygen. Instead of being tethered to a desktop and other stand-alone devices, people should be able to interact with computers easily and naturally, from a distance, through conversation or gesture.
As a first step, principal research scientist James Glass, SM ‘85, PhD ‘88, is creating language-processing systems that go beyond simple speech recognition and “track some sort of meaning, to understand the content and context of the conversation,” he says. His group created a system that allows someone to inquire over the phone about restaurants in the Boston area. The system analyzes each sentence using grammatical rules to figure out what information the caller needs, then searches a database that includes information about local restaurants-their locations, phone numbers, types of cuisine, and price ranges. Since this database is constantly changing, Glass says, it’s difficult for the program to learn every restaurant’s name. So instead, it assumes that unknown words are probably restaurant names and searches the database for likely matches. Then the system reprocesses the question and finds the phone number in a matter of seconds.
But speech is just one mode of communication. “One of the things about Oxygen is that it’s not trying to develop [stand-alone] technologies in networking, speech, and vision,” says Zue. “Increasingly, it’s the integration of these technologies.” Glass’s group and associate professor Trevor Darrell, SM ‘90, PhD ‘96’s vision group are collaborating on a system that combines speech and vision technologies. The system allows someone standing in front of a projected wall display to create and manipulate geometric shapes by gesturing and giving spoken commands such as “add a yellow pyramid here,” or “resize this.” The system tracks the person’s movements through a stereo camera and captures his or her voice through a nearby microphone array. Although the prototype is fairly simple, Darrell imagines that future systems may be used in physical-therapy programs or video games.
In some cases, people won’t need to give commands because computers embedded in their offices will anticipate their needs. The groups headed by Shrobe and Darrell have developed prototype offices that can learn their occupants’ patterns of behavior. Stereo cameras first track how a subject uses the space. Once the system understands how people’s locations correspond to their needs, computers, lights, and even radios can react to their movements. “A normal computer is blind to whether I’m sitting in front of it, sitting on the couch, or off in the kitchen making coffee,” says Darrell. But a vision-enabled room could direct a cell-phone call to voice mail if it recognized that the recipient was sitting at a table with three other people and, therefore, likely having a meeting.