Easy-to-make videos can show you dancing like the stars

Want to dance like a professional ballerina or strut like a rapper? A new machine-learning technique can transfer one person’s motion to another in a simple process.

Emerging Technology from the arXivarchive page

September 3, 2018

Mapping one person’s motion onto the movement of another has changed the way filmmakers, animators, and game designers create action. With this technique, one person can appear to dance, run, or shoot like somebody else.

The technique is based on the ability to monitor movement using a variety of sensors and cameras to build up a 3-D picture of someone’s pose. But the process is generally expensive and time consuming. So a technique that performs the same trick using the 2-D video taken from a single camera would be hugely innovative.

Enter Caroline Chan and colleagues from the University of California, Berkeley, who have devised a clever “do as I do” motion transfer technique based on 2-D video from a single camera. The technique allows them to transfer the performance of a professional dancer to an amateur with relative ease.

Chan and co’s new method is straightforward. They start with two videos. One shows the movement of an individual to be transferred—the source. The other is the target, showing the individual whose movement is to be adapted.

“Given a video of a source person and another of a target person, our goal is to generate a new video of the target person enacting the same motions as the source,” say Chan and co.

Their approach is to do this frame by frame. They start with a frame from the source video, with the aim of mapping the pose in this frame to the target video. “Our goal is therefore to discover an image-to-image translation between the source and target [frame] sets,” they say.

One way to do this would be to video the target individual performing a wide range of movements to create a database of all possible poses. The task is then to choose the target pose that matches the source pose.

But this is impractical. “Even if both [target and source] subjects perform the same routine, it is still unlikely to have an exact frame to frame body-pose correspondence due to body shape and stylistic differences unique to each subject,” say Chan and co.

So the team take a different approach involving an intermediate step. They begin by videoing the target individual making a series of movements and then map these onto a simple stick figure, which encodes body position but not appearance.

This creates a database in which each frame is associated with a stick-figure pose. Chan and co then use this database to train a type of machine-vision system called a generative adversarial network to do this task in reverse—to create an image of the target person, given a specific stick-figure pose.

This turns out to be the key to transferring the motion of the source to the target. It is a relatively simple process to convert the source’s body pose into a stick-figure pose and then feed this into the generative adversarial network to create an image of the target in the same pose.

Training the generative adversarial network requires about 20 minutes of 120-frame-per-second video of the target individual performing a wide range of movements. This kind of video is possible to shoot with many modern smartphones. “Since our pose representation does not encode information about clothes, we had our target subjects wear tight clothing with minimal wrinkling,” say Chan and co.

By contrast, the researchers do not need the same amount of video of the source, since all they need to do is pose detection. “Many high quality videos of a subject performing a dance are abundant online,” they say.

To make the resulting motion-transfer video more realistic, Chan and co add two extra flourishes. They ensure that each frame they create is different from the previous frame in only a small way. This ensures that the resulting video is smooth.

They also train another generative adversarial network to create realistic images of the target person’s face as the pose changes. This increases the video’s realism.

The results are impressive. You can watch a wide range of examples on the video shown here. It shows the movement of professional dancers and even a ballerina transferred to ordinary individuals. “Our method is able to produce compelling videos given a variety of inputs,” say Chan and co.

There are a few challenges to work on. One problem is that the stick-figure pose generation does not account for differing limb lengths between the source and target; nor does it allow for the way different cameras and angles can manipulate or foreshorten certain poses. And occasionally the system cannot detect the correct pose because the subject is moving rapidly or obscured.

Given the impressive results, these are clearly not show-stopping problems. They can be ironed out in the near future.

The only question now is how the technique will come to market. It will surely be of interest to a wide range of startups and more established players in social image sharing.

Ref: arxiv.org/abs/1808.07371 : Everybody Dance Now

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.