Computing Gets Physical

Gadgets that let you control computers with a wave or a nod could offer an escape from keyboards and mice.

David Kushnerarchive page

July 1, 2004

For once, I control the weather.

I’m standing in front of a green backdrop inside a windowless studio at Cybernet Systems, a technology research and development company in Ann Arbor, MI. A digital camera in front of me is beaming my image, real time, to a television monitor that shows a scene typical of a nightly news weather report. There I am, standing before a map of the Midwest. I extend my arm and begin twirling my hand over the blip of Detroit. The map behind me zooms in on the area beneath my palm. The city widens into view and comes into focus. Looks like it’s going to be a wet one, folks.

This is GestureStorm-a software system Cybernet developed to let weather broadcasters run through their forecasts with simple flicks of the hand. No wires. No buttons. No geeky audiovisual control panels. Move a hand one way, and you paint raindrops on-screen. Move it another, and you stir up a tornado. The interface is completely a matter of gesture. And if a lot of people have their way, this is only the beginning. Gesture recognition technology aims to become this millennium’s remote control-a fluid, freeing means of interacting with all the digital stuff around us. Think Minority Report. In that film, Tom Cruise stands before a futuristic digital display, pointing and waving his way through a cascade of images and documents. This stuff, once the domain of science fiction, is finally creeping into the real world.

In Orlando, FL, WKMG became the first television station to use GestureStorm when it unveiled the system in December. In July 2003, Sony Computer Entertainment released the EyeToy, a PlayStation 2 peripheral that, using special software and an inexpensive digital camera, can project a video feed of a player into a game, even responding to the player’s movements; instead of zapping a bad guy with a controller button, the gamer gives him a swift karate chop. This year, two companies will debut virtual keyboards that let people control personal digital assistants and even automotive equipment with gestures. As far as Charles Cohen, vice president for research and development at Cybernet, is concerned, gesture recognition’s time has come. “Gesture recognition is remote control with a wave of a hand,” he says.

As I unleash some storm clouds over Detroit, I see what he means. Of course, playing weatherman is one thing, but importing gesture recognition into daily life is another, as Cohen and the others pioneering the technology are learning. “I don’t know what the killer app for gesture recognition is yet,” Cohen confesses.

Get in the Game

On the frontier of gesture recognition technology, there may be no better judge of a killer app than a four-year-old. That’s who I enlisted when I first hooked my PlayStation 2 up to an EyeToy-an unassuming device that’s shaping up to be the Pong of gesture interfaces. Intuitive, fun, and physical, it embodies the promise of gesture recognition wares. That promise is freedom-freedom from 14-button controllers, keyboards, mice, cables. “Everyone agrees that the keyboard isn’t necessarily the most optimal way to interface,” says senior analyst Joe Laszlo of Jupiter Research, a technology research firm based in New York City.

The EyeToy might be the first gesture recognition device to deliver a viable alternative to a keyboard or game controller. The hardware is a black-ribbed, rectangular digital camera about the size of a deck of cards. It plugs into the USB port on the front of the PlayStation. For about $50, you get the camera plus a CD of 12 games. Once the device is connected, you put it on top of your television set and angle it forward. The outline of a human body appears in the center of the screen, and you position yourself before the camera so that you fill it in.

“Four-year-old, come hither!” I say, helping my daughter to stand in the middle of the outlined shape. She waves at herself cautiously. “Where’s the game?” she asks. “You’re in it!” I respond. In front of her image on-screen floats a swarm of multicolored discs. To make a selection, she must wave her hand over the disc representing the game she’d like to play. The games are simple, almost like 21st-century versions of the old Atari classics Tennis and Combat. There’s a boxing game, a juggling game, a dancing game.

My daughter likes the sound of Wishi Washi, so Wishi Washi it is. Foamy bubbles cover the screen in front of her image. The object of the game is to “wipe” the screen clean to the strains of Dixieland jazz. Hesitantly at first, she waves her arm as if she were making a snow angel, and a corresponding blob of bubbles on-screen vanishes. The camera is capturing her movements, real time; before long she realizes she can use more than just her hands. By the end, she is jumping, leaning, kicking, flapping, using every physical motion she can muster to wipe away the foam. It’s not often you see a gamer sweat.

And plenty of gamers are sweating over the EyeToy. In the gaming business, sales of 500,000 constitute a hit. As of March, more than 500,000 EyeToys had been sold in the United States and more than two million in Europe.

Video from the tiny EyeToy camera is compressed and fed through the USB port. Once inside the PlayStation, the video is processed through “conceptual subtraction,” which compares the images in successive frames. The entire transaction uses less than 10 percent of the PlayStation 2’s processing power, leaving a hefty 90 percent to render the explosions, foam baths, and other graphics features of the games themselves.

In its current iteration, the EyeToy is limited to motion detection, but later versions will include more advanced features. Sony has already developed EyeToy software that can, for example, track different colors in an environment-even different faces. And it can deliver the sort of gesture recognition capabilities that would make, say, a Harry Potter video game truly come to life: draw a triangle with your wand and unleash a firestorm on-screen; draw a circle and turn your enemy into snow. “You’ll be able to cast a different kind of magic spell according to the shapes you draw in the air,” says Richard Marks, special-projects manager for research and development at Sony Computer Entertainment of America. EyeToy is his brainchild. Marks began working in the field of “computer vision”-technology that enables computers to perceive their surroundings-while developing cameras for underwater robots at the Monterey Bay Aquarium Research Institute in Moss Landing, CA. “I thought the PlayStation 2 would be good at computer vision,” he says.

But there are shortcomings. The USB port’s limited data-handling capacity results in fuzzy video and makes multiplayer online EyeToy experiences impossible. And the software can have difficulty discerning a player’s movements in a bright and busy environment-such as a typical family room. Marks says these problems will fade when the PlayStation 3 hits shelves, probably some time in 2006. The next-generation console will include a USB 2.0 port-as much as 40 times faster than a USB 1-which will reduce the fuzziness. Recognizing gestures against bright and busy backgrounds might require gamers to wave hot-pink wands or slip on gloves. Sony is distributing software tools that will help game developers exploit the new technology.

The ultimate goal is that you won’t need any prop at all. “The only thing you’ll need,” Marks says, “is your hand.”

“How Come We Never Thought of This?”

Cybernet has emerged as ground zero for the commercialization of gesture interface technology. I’ve stared at a computer screen for countless hours, but on this morning inside the company’s offices, things somehow look different. On the screen is a typical assortment of folders and program icons. When I look at the Internet Explorer icon in the upper left-hand corner, however, something strange happens. The cursor moves toward where I’m looking. No mouse. No keyboards. My hands are resting at my sides. It’s like a Ouija board.

I’m using Navigaze, a new interface based entirely on eye movement. Instead of double-clicking, for example, you double-blink; with Navigaze, Christopher Reeve could surf the Web. Cybernet will roll out Navigaze this spring, along with an improved version of a gaming technology called Use Your Head-a system (first introduced in 2000) that lets you input directional instructions by bobbing your noggin. A camera tracks a player’s head motion, and the on-screen image changes accordingly: lean left, and your field of vision turns left; lean right, and the view shifts the other way.

Cybernet made its name in the late 1980s in force feedback, the haptic technology now available for video games as well as in the automotive and medical industries. Cohen sees gesture recognition as another field ready to bloom. “Gesture recognition is in the stage that force feedback was in ten years ago,” he says.

One of Cybernet’s earliest forays into gesture recognition came in 1998, when the U.S. Army contracted with the company to create a gesture-based computerized training system: a trainee could command a troop of simulated soldiers by making a variety of hand movements. NASA commissioned the company to create a gesture-based information kiosk for the public, but that project didn’t get far. “Students kept putting their gum on the kiosk and messing it up,” Cohen says.

So far, the closest the company has come to finding the killer app for gesture interface is a military system that enables the manipulation of images on command-and-control maps. After reading a press release about the work, a television station expressed interest in adapting the technology for its meteorologist. “It was perfect!” Cohen recalls. “How come we never thought of this?”

The TV weather application was perfect for one primary reason: its surrounding environment didn’t have to be engineered. EyeToy, by contrast, works only if you stand in a certain place relative to the camera; if someone blocks the camera’s view, everything goes haywire. Because a TV meteorologist stands in front of a consistent, unobstructed background, there would be no such disruptions to contend with.

Virtual Keyboards and Beyond

The clouds have parted. The rain has ceased. As I finish my round of GestureStorm theatrics, I decide to shoo away the clouds and let Detroit return to its peace and calm once again.

Over lunch at a nearby Italian restaurant, Cybernet’s Cohen suggests that the mission of gesture recognition is not necessarily to supplant the old keyboard and mouse but, rather, to supplement them. “I won’t say gesture recognition is the be-all and end-all,” he says.

Indeed, one intriguing application illustrates the way that gesture technology could dovetail with conventional interfaces. A device from San Jose, CA-based Canesta-due out later this year-brings gesture recognition to personal digital assistants. The device projects an image of a keyboard onto a flat surface, such as a desk, through a tiny lens inside the PDA. An infrared light beam directed at the zone just above the projected keyboard senses precisely where the user’s fingers are at any instant: the device monitors the time it takes for a pulse of infrared light to leave the emitter, bounce off the moving fingertips, and return to a sensor in the PDA. A pulse’s round-trip travel time corresponds with a specific distance, providing a 3-D map of the fingertips’ position over the keys, so whatever the user types on the virtual keyboard is captured digitally inside the PDA.

The Canesta device operates at more than 50 frames per second, so it can keep up with even the speediest typist. Because Canesta’s technology uses infrared light to measure the distance to the object, it could potentially alleviate one of the problems facing Sony and Cybernet: how to perceive gestures against a bright or busy background. With the current configuration of the EyeToy, for example, I’d seriously mess up my daughter’s game of Wishi Washi if I passed in front of the camera’s background while she’s playing. If Canesta’s infrared light were trained on her, and her alone, the game wouldn’t register my interruption. Canesta considers the $11 billion video game industry to be a future target area and says it has talked with a number of major players in the electronic-entertainment business. Later this year, a Jerusalem, Israel, company called VKB will introduce a competing virtual keyboard that employs technology similar to Canesta’s.

Beyond keyboards, weather forecasting, and games, gesture recognition technology could transform the way people interact with computers in a variety of settings. Universities have been working on the technology for years. Researchers at the Georgia Institute of Technology, for example, have explored how gesture recognition may help reduce automobile accidents. A group led by Thad Starner has created what it calls a “gesture panel” in place of a standard dashboard control. The driver adjusts the car’s temperature or sound system volume by maneuvering her hand over a designated area, without having to take her eyes off the road.

Researchers at MIT’s Media Laboratory have studied ways in which gestures could be used to enhance various entertainment devices. A “StoryMat,” for example, could recognize and react to movements of particular toys on a child’s play mat. A “conversational humanoid” senses and responds to a person’s motions, as reported by a wearable, electromagnetic tracking device. Other projects examine the emotional messages that gestures and posture convey. Research has shown that it’s possible to program machines to discern the interest-or lack thereof-that children display when interacting with educational software, says Rosalind W. Picard, director of the lab’s affective-computing research group. A program that incorporated such inadvertent user input could respond accordingly-perhaps by switching activities when the user slumped in apparent boredom.

Not surprisingly, some effort has also gone toward endowing Microsoft products with gesture interfaces. During the 1990s, researchers at the University of Cambridge in England developed an experimental system called Jester that employed gesture recognition for surfing through Windows; it never made it out of the lab. Another truly killer application would be a gesture interface for PowerPoint-the ubiquitous presentation software. At Cybernet, Cohen is working on such an interface himself. It could require the presenter to slip on a glove that would be recognized by the computer’s eye. One can only imagine the fashion possibilities.

For now, however, there’s nothing quite as efficient and responsive as the keyboard I’m typing on at the moment. It works in any shade of light. It doesn’t get confused if my kid darts into the room. And with the help of a mouse, it lets me call up my files quicker than I can blink.

“Whenever you want to introduce a new user interface,” analyst Laszlo says, “simplicity and intuitiveness are key. When the mouse was introduced, the learning curve wasn’t steep.”

And that gives companies like Cybernet some hope. Because there’s nothing more intuitive than a wave of the hand.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.