Select your localized edition:

Close ×

More Ways to Connect

Discover one of our 28 local entrepreneurial communities »

Be the first to know as we launch in new countries and markets around the globe.

Interested in bringing MIT Technology Review to your local market?

MIT Technology ReviewMIT Technology Review - logo


Unsupported browser: Your browser does not meet modern web standards. See how it scores »

{ action.text }

But then, why didn’t nature, or God, make speech just as asymmetric as vision? I’ll venture the guess that speaking and listening were meant for intercommunication rather than perception, where, unlike survival, symmetry was desirable. And since survival was more important than chatting, the lion’s share of the human brain was dedicated to seeing.

These conclusions run against the common wisdom that for human-machine communication, “vision is just like speech, only more powerful.” Not so! These two serve different roles, which we should imitate in human-machine communication: Spoken dialogue should be the primary approach for back-and-forth exchanges, and vision should be the primary approach for human perception of information from the machine.

We can imagine situations where a visual human-machine dialogue would be preferable, for example in learning by machine to ski or juggle. But we are interested in human-machine intercommunication across the full gamut of human interests, where, as telephony has demonstrated, speech-only exchanges go a long way. (Might these basic differences between speech and vision have contributed to the lack of success of video telephony?)

Finally, if we can combine speech and vision in communicating with our machines, as we do in our interactions with other people, we’ll be even better off. But that’s not easy to do yet, because the technologies for speech and vision are in different stages of development. Nor is the wish to combine them reason enough to ignore their different roles.

Conclusion: When you face a machine, instead of your surrounding world and other people, your interactions will be comparably natural, and the machine easiest to use, if it uses speech understanding and speech synthesis for two-way human-machine dialogue (these technologies have begun appearing commercially), and if it has large realistic displays that convey to you a great deal of visual information (as do today’s displays). As machine vision improves, it should be combined with speech for even more natural human-machine exchanges (such combined capabilities are now being researched and demonstrated in several research labs).

So, take heart: Simpler, more natural computer systems will enter our lives within the next 5 to 10 years. Let’s speed up their arrival, as users by asking for them, and as technologists by daring to build them.

0 comments about this story. Start the discussion »

Tagged: Communications

Reprints and Permissions | Send feedback to the editor

From the Archives


Introducing MIT Technology Review Insider.

Already a Magazine subscriber?

You're automatically an Insider. It's easy to activate or upgrade your account.

Activate Your Account

Become an Insider

It's the new way to subscribe. Get even more of the tech news, research, and discoveries you crave.

Sign Up

Learn More

Find out why MIT Technology Review Insider is for you and explore your options.

Show Me