Machines will need to get a lot better at making sense of the world on their own if they are ever going to become truly intelligent.
DeepMind, the AI-focused subsidiary of Alphabet, has taken a step in that direction by making a computer program that builds a mental picture of the world all by itself. You might say that it learns to imagine the world around it.
The system, which uses what DeepMind’s researchers call a generative query network (GQN), looks at a scene from several angles and can then describe what it would look like from another angle.
This might seem trivial, but it requires a relatively sophisticated ability to learn about the physical world. In contrast to many AI vision systems, the DeepMind program makes sense of a scene more the way a person does. Even if something is partly occluded, for example, it can reason about what’s there.
Eventually, such technology might help serve as the foundation for deeper artificial intelligence, letting machines describe and reason about the world with much greater sophistication.
Ali Eslami, a research scientist at DeepMind, and his colleagues tested the approach on three virtual settings: a block-like tabletop, a virtual robot arm, and a simple maze. The system uses two neural networks; one learns and another generates, or “imagines,” new perspectives. The system captures aspects of a scene, including object shapes, positions, and colors, using a vector representation, which makes it relatively efficient. The research appears in the journal Science today.
The work is something of a new direction for DeepMind, which has made its name by developing programs capable of performing remarkable feats, including learning how to play the complex and abstract board game Go. The new project builds upon other academic research that seeks to mimic human perception and intelligence using similar computational tools.
“It is an interesting and valuable step in the right direction,” says Josh Tenenbaum, a professor who leads the Computational Cognitive Science group at MIT.
Tenenbaum says the ability to deal with complex scenes in a modular way is impressive but adds that the approach shows the same limitations as other machine-learning methods, including a need for a huge amount of training data: “The jury is still out on how much of the problem this solves.”
Sam Gershman, who heads the Computational Cognitive Neuroscience Lab at Harvard, says the DeepMind work combines some important ideas about how human visual perception works. But he notes that, like other AI programs, it is somewhat narrow, in that it can answer only a single query: what would a scene look like from a different viewpoint?
“In contrast, humans can answer an infinite variety of queries about a scene,” Gershman says. “What would a scene look like if I moved the blue circle a bit to the left, or repainted the red triangle, or squashed the yellow cube?
Gershman says it’s unclear whether DeepMind’s approach could be adapted to answer more complex questions or whether some fundamentally different approach might be required.