DeepMind has developed software that forms links between activities and sounds in video through unsupervised learning. New Scientist reports that the firm's new AI uses three neural nets: one for image recognition, another for identifying sounds, and a third that ties results from the two together. But unlike many machine-learning algorithms, which are provided with labeled data sets to help them associate words with what they see or hear, this system was given a pile of raw data and left to fend for itself.

It was left alone with 60 million video stills, each of which came paired with a one-second audio clip taken from the same point in a video from where the frame was captured. Without human assistance, the system then slowly learned how sounds and image features were related—ultimately finding itself able to link, say, crowds with a cheer and typing hands with that familiar clickety-clack. It can’t yet put a word to any of its observations, but it is another step toward AIs being able to make sense of the world without constantly being told about what they see.