Technology Transfer Project

Amazon ResearchAward: Geometry-aware 3D Human Body Animation from Still Photos

Type

Technology Transfer Contract

Start Date

14/03/2019

End Date

14/03/2020

Project illustration

Staff

Project Description

This is a project under an 2018 Amazon Research Award.

Being able to automatically generate 3D animations of the human body from a single image would open the door to many new exciting applications in different areas, including the movie industry, photography technologies, fashion and e-commerce business, to name a few. Recent advances in Generative Adversarial Networks (GANs) have shown impressive results for the related task of facial expression synthesis. In [Pumarola ECCV 2018] we introduced one such approach that anatomically encodes facial expressions in a continuous manifold. Automatic facial animation with GANs, however, has been addressed from a purely 2D perspective, being thus limited to single viewpoints, typically fronto- or quasifrontoparallel faces.

In this project, we will extend this problem to the full human body and varying viewpoints, and given a single photo of a person, we will research approaches to forecasting his/her motion and synthesizing the associated images, even when these involve different body orientations and changing postures.

Compared to face animation, bringing still images of the full body to life involves dealing with a much larger variability of body configurations and appearances due to the clothes. Concretely, addressing this complex endeavor requires resolving a number of sub-tasks including foreground-background segmentation, single-image 3D human pose and shape estimation, action recognition, motion prediction, and photorealistic image synthesis. Each one of these problems is, by itself, tremendously challenging.

Yet, we aim at developing GAN architectures able to tackle all of them by integrating different sources of information, either through novel geometry-aware differentiable modules able to estimate and predict human pose and shape, loss functions enforcing photorealism of the synthesized images, as well as attention mechanisms that focus on the region of the image where the person is located.

An additional difficulty is that there is no available dataset of human action video sequences annotated with accurate volumetric body shape parameters and background segmentation masks. Since this type of dataset is indeed very difficult to acquire, we will intend to develop approaches that use as little supervision as possible.