What it does: First, the system is trained on footage of activities labelled with descriptions like "playing golf on grass." It can then recreate similar scenes given a snippet of text. Plus, it can make clips combining disparate concepts from training data, such as "sailing on snow."
Why it matters: Automatic generation of video from text could be incredibly useful—for creating huge sets of synthetic training data for autonomous cars, say. It could also lead to some worrying fake content too.
But: The clips are just 32 frames long and 64x64 pixels in size. They're still not wholly convincing, and if they're made larger, accuracy plummets. All that needs fixing to build a useful text-to-video converter.