Speech and Vision

For thousands of years, people have interacted through speech and gesture. Truly easy-to-use machines will do the same.

Michael Dertouzosarchive page

May 1, 2000

After 40 years of humans serving computers, people are finally beginning to wake up and demand that the relationship be reversed: They want machines to become simpler, address human needs and help increase human productivity. That’s as it should be. With computer technology maturing, it’s high time, as we have repeatedly said in this column, for makers and users of computers to change their focus from machines to people.

But how do you make a machine simpler? With fewer controls? Not really. As you know from watches and other devices that have one mode-changing button and another to select functions within a mode, you can easily get lost in a confusion of modes and functions. Would machines be simpler if they had fewer capabilities? No! Imagine a car that can only do two things-go and turn right. This car can go anywhere, but you wouldn’t prefer it to your own car, which has more capabilities. What is it that we really want when we ask that our computers be easier to use?

True ease of use, as opposed to the perfunctory use of colors and floating animals to create the illusion of “user friendliness,” involves means of interaction that are natural to people and therefore require no new learning. Speech and vision are the two principal means we have used to interact with other people and the world around us for thousands of years. That should be enough of a clue to steer our attention to these two modes of interaction. And since vision occupies two-thirds of the human cerebral cortex, we may be tempted to declare it the queen of human-machine communication. That would be an easy-but deceptive-conclusion.

Vision and speech do not serve the same natural roles in communication. Being Greek, I can still hold a “conversation” in Athens, through a car window, using only gestures and grimaces-one clockwise rotation of the wrist means “how are you” while an oscillating motion of the right hand around the index finger with palm extended and sides of mouth drawn downward, means “so-so.” A sign language like ASL works even better. But when speech is available, it invariably takes over as the preferred mode of human communication.

That’s because among people, speech is used symmetrically for transmitting and receiving concepts. Visual human communication, on the other hand, is highly asymmetric-we can perceive a huge amount of information with our eyes, but can’t deliver equally rich visual information with gestures (visual communication would be more symmetric if we all sported display monitors on our chests). The power of vision in humans lies primarily in the one-way digestion of an incredible amount of information-nature’s, or God’s, way of ensuring survival in a world of friends and enemies, edible and man-eating animals, useful and useless objects, lush valleys and dangerous ravines, where maximum information was essential.

But then, why didn’t nature, or God, make speech just as asymmetric as vision? I’ll venture the guess that speaking and listening were meant for intercommunication rather than perception, where, unlike survival, symmetry was desirable. And since survival was more important than chatting, the lion’s share of the human brain was dedicated to seeing.

These conclusions run against the common wisdom that for human-machine communication, “vision is just like speech, only more powerful.” Not so! These two serve different roles, which we should imitate in human-machine communication: Spoken dialogue should be the primary approach for back-and-forth exchanges, and vision should be the primary approach for human perception of information from the machine.

We can imagine situations where a visual human-machine dialogue would be preferable, for example in learning by machine to ski or juggle. But we are interested in human-machine intercommunication across the full gamut of human interests, where, as telephony has demonstrated, speech-only exchanges go a long way. (Might these basic differences between speech and vision have contributed to the lack of success of video telephony?)

Finally, if we can combine speech and vision in communicating with our machines, as we do in our interactions with other people, we’ll be even better off. But that’s not easy to do yet, because the technologies for speech and vision are in different stages of development. Nor is the wish to combine them reason enough to ignore their different roles.

Conclusion: When you face a machine, instead of your surrounding world and other people, your interactions will be comparably natural, and the machine easiest to use, if it uses speech understanding and speech synthesis for two-way human-machine dialogue (these technologies have begun appearing commercially), and if it has large realistic displays that convey to you a great deal of visual information (as do today’s displays). As machine vision improves, it should be combined with speech for even more natural human-machine exchanges (such combined capabilities are now being researched and demonstrated in several research labs).

So, take heart: Simpler, more natural computer systems will enter our lives within the next 5 to 10 years. Let’s speed up their arrival, as users by asking for them, and as technologists by daring to build them.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.