Microsoft’s Plan to Bring About the Era of Gesture Control

Apple might have made the touch screen ubiquitous, but Microsoft thinks hands-free interfaces will be just as big.

Tom Simonitearchive page

October 22, 2012

While most of the headlines about Microsoft this fall will concern its new operating system, Windows 8, and its new Surface tablet, the company is also working hard on a long-term effort to reinvent the way we interact with existing computers. The company wants to make it as common to wave your arms at or speak to a computer as it is to reach for a mouse or touch screen today.

That’s the goal of a program called Kinect for Windows, which aims to put the wildly successful gaming accessory Kinect wherever Microsoft’s ubiquitous operating system is found. It’s also designed to allow computers to be used in new ways—for example, by surgeons who don’t want to touch a keyboard with sterilized hands midway through surgery.

“We’re trying to encourage [software] developers to create a whole new class of app controlled by gesture and voice,” says Peter Zatloukal, head of engineering for the Kinect for Windows program.

It wasn’t long after the Kinect first launched that software engineers began repurposing it for different uses, such as robot navigation or controlling a Web browser (see “Hackers Take the Kinect to New Levels”). Kinect for Windows is intended to harness that enthusiasm and make the technology widespread.

Zatloukal says the result will be on a par with other big shifts in how we control computers. “We initially used keyboards, then the mouse and GUIs were a big innovation, now touch is a big part of people’s lives,” he says. “The progression will now be to voice and gesture.” Health care, manufacturing, and education are all areas where Zatloukal expects to see Kinect for Windows succeed. A conventional keyboard, mouse, or touch screen can be difficult to use in classrooms and hospital wards, or on factory floors.

A Kinect unit uses infrared and conventional cameras to track gestures and a microphone to take voice input (see “Hack: Microsoft Kinect”). Kinect for Windows equipment went on sale in February for $249 and is now available in 32 countries. Earlier this month, Microsoft made the hardware available for the first time in China, a country with both many potential customers and a sizable number of computer manufacturers.

For the platform to reach millions of people, Microsoft needs software developers to create killer applications. Along with the hardware, the company provides a software developer’s kit, or SDK, that offers a range of ready-made tools, including voice recognition and body-part tracking. Earlier this month this was expanded to allow applications that sense objects and people more than four meters away and make it possible to control software with a virtual cursor.

“We opened up a lot more of the data from the sensor,” says Zatloukal. “For example, by using infrared, your apps can see in the dark now.” That could allow a Kinect-based system to understand gestures while a person is watching TV in a darkened room.

Zatloukal won’t talk about how exactly Microsoft will take Kinect for Windows to market. The hardware is currently pitched only at software developers, but if many programs become available Microsoft could offer it directly to consumers, or encourage computer manufacturers to bundle it with desktops, laptops, or monitors in place of a regular webcam.

Some large companies, including Bloomingdale’s and Nissan, are collaborating with Microsoft to test out possible applications. Nissan has introduced a gesture-controlled system for dealerships that lets prospective buyers look inside a virtual version of a new car. Meanwhile, startups are appearing that sell systems based on Kinect for Windows (see “Meet the Nimble-Fingered Interface of the Future”).

One such startup is GestSure, based in Seattle, which sells a product based on Kinect for Windows that lets surgeons use gestures to control a computer showing medical images. Although referring to scans and other images is crucial in the operating room, surgeons can’t use a computer without touching an unsterilized controller and having to scrub up again, or asking a nurse to operate the computer for them. Both these options waste valuable time, says Jamie Tremaine, cofounder and CEO of the startup.

GestSure sells a box that, once connected to an operating room computer, waits for a surgeon to make a specific hand signal and then tracks the doctor’s arms to allow control of the cursor. “When there’s no barrier to referring to images, surgeons check [them] a lot more,” says Tremaine.

Tremaine says the Kinect hardware’s ability to perceive depth makes it easier to recognize and track objects, something very challenging using an ordinary camera. It even trumps voice recognition, he says. “Voice recognition is 95 to 98 percent accurate, so one time in 50 it won’t work,” he says. “This works like a tool—it will work for you every time.”

Other startups building products around Kinect for Windows include Jintronix, which is using it to help people with physical rehabilitation after a stroke. Users can work through exercises at home while their computer watches closely and gives feedback. Another company, Freak’n Genius, offers gesture-based animation software.

Mark Bolas, an associate professor and director of the Mixed Reality Lab at the University of Southern California, says the technology in Kinect for Windows can help make it more natural to use computers. “When using a computer today, we think of our bodies as a fingertip or at most two fingertips,” he says. But humans evolved to communicate with their whole bodies.

Bolas’s research group is experimenting with using Kinect to track very subtle behaviors—monitoring the rise and fall of a person’s chest to measure breathing rate, for example. “It’s a great cue for stress and useful for teleconferencing,” he says.

Displaying an indication of someone’s breathing rate during a video call allows others to understand a person better, he says, and can show when to start talking without interrupting. The group has also experimented with detecting fidgeting or defensive body language such as folded arms. The hope is to address the social cues that are lost when video calls replace face-to-face communication, says Bolas. “Meat matters,” he says, “and the Kinect brings that to computing.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.