natural language processing is part of Microsoft’s overall goal of making computers-well, more like people. Machines of the future should be able to respond not only to spoken commands but to gestures and facial expressions. Indeed, it’s by fusing advances from a variety of research areas that managers expect to realize the full potential of their interdisciplinary lab. That hope is embodied in “flow.”
The idea is that computing is undergoing a fundamental transformation. Having started out as a behemoth calculator, then evolving into an office productivity tool, argues Jim Kajiya, the computer is now becoming “primarily a medium for information flow.” The signs are in the ether: e-mail, computer-driven videoconferencing, Web surfing-these things go beyond document preparation to communication and learning about the world.
At Microsoft Research, two fields-vision and graphics-are converging rapidly to address the concept of flow. The basic idea behind the vision part of the challenge is to give computers the power to interpret and extract information from visual cues such as stored images or live camera feeds. Roughly half of Microsoft’s 14-member vision group is working mainly on constructing fresh viewpoints of a given scene from just a few images of the setting, as shot from different angles. The other half focuses on vision-based user interfaces. This involves such problems as watching for faces through a camera mounted atop the computer and determining whether there’s someone in front of the machine, where he or she is looking-even the expression on his or her face.
Graphics, of course, is heavily involved in techniques for assembling 3-D pictures. Senior graphics researcher Brian Guenter, for instance, is building an elaborate database of faces, facial muscle movements and expressions. When combined with speech recognition, natural language processing and vision technology, he hopes this archive will enable him to create virtual characters who communicate online via the spoken word-not text-and whose lips and expressions move in perfect sync with their voices.
Vision and graphics technologies might be combined like this: Users select a face as their online persona. They then sit at their computers, speaking into a microphone while a camera captures their expressions. Words, grimaces and smiles would all be broken into bits, transmitted over phone lines, and reconstructed in the virtual environment so that they appeared to emanate from the onscreen character. Such technology, Guenter believes, will prove a boon to online chat sessions and game-playing, where people want a degree of anonymity but desire more nuanced interactions than plain text affords.
The potential extends beyond the domain of online chat and game-playing. Kajiya, an Asian-American whose flowing mane and huge muttonchop sideburns make him look like a sci-fi zen master, also speaks of video-voice connections that will bring long distance, person-to-person communications into an entirely new dimension. The system would use programs that track gestures and eye movements and instantly redraw the screen image so that the person on the other end gets a different representation of the same scene, taking into account where his colleague is pointing or looking-just as they would in real life.