Technology Transfer Project

Google ResearchAward: GANimation3D: Unsupervised 3D Face Animation from Monocular Images


Technology Transfer Contract

Start Date


End Date


Project illustration


Project Description

This is a project under a Google Research Award 2019.

Being able to automatically generate 3D face animations from a single image or a monocular video would open the door to many new exciting applications in different areas, including the creation of avatars for Augmented Reality, and applications in the movie industry, photography technologies and e-commerce business, to name a few. This problem can be addressed using a modelbased approach that reasons about the geometry, by initially fitting an accurate 3D model onto the input image and later mapping the image texture to warped versions of the underlying mesh [Kim SIGGRAPH 2018, Thies CVPR 2016]. This approach however, relies on establishing 2D-to-3D landmark correspondences between the input face and the 3D model, which can be difficult to obtain in many real situations, e.g. due to low image resolution or motion blurring.

An attractive approach to avoid relying on feature correspondences would be to use large amounts of data to train a deep network to synthesize new facial expressions directly from the input image. Along this line, recent advances in Generative Adversarial Networks (GANs) have shown impressive results for automatically changing the face expression among a discrete number of categories (e.g. from smiley to sad face) [Choi CVPR 2018, Zhu ICCV 2017, Li arXiv 2016]. In [Pumarola ECCV 2018] we moved forward and proposed an anatomically-aware approach that allows smoothly changing the face expression in a continuous manifold. Automatic facial animation with GANs, however, has been addressed from a purely 2D domain, being thus limited to single viewpoints, typically fronto- or quasi-frontoparallel faces.

In this project, we will devise methodologies to integrate 3D geometry priors within the GAN learning process, in order to build a system which, given a single still image or a monocular video, is able to render new facial expressions while simultaneously allowing to change the camera viewpoint and/or 3D face orientation. Addressing this complex endeavor requires resolving a number of sub-tasks including face localization and background subtraction, monocular 3D pose and shape estimation and photorealistic image synthesis. Each one of these problems is, by itself, tremendously challenging. Yet, we aim at developing GAN architectures able to tackle all of them by integrating different sources of information, either through novel geometry-aware differentiable modules to estimate and predict face pose and geometry, loss functions enforcing photorealism of the synthesized images, as well as attention mechanisms that focus on the region of the image where the face is located.

An additional difficulty is that there is no available dataset of face expressions under different camera views and annotated with the 3D shape. Since this type of dataset is indeed very difficult and costly to acquire, we will intend to develop approaches that use as little supervision as possible.