Personalized Face Modeling for Improved Face
Reconstruction and Motion Retargeting
ECCV 2020 (Spotlight)
1 University of Washington 2 Microsoft Cloud and AI
Unfortunately, we won't be able to release our code since some parts of the code belongs to Microsoft's confidential property.
Our end-to-end framework. Our framework takes frames from in-the-wild video(s) of a
user as input and generates per-frame tracking parameters via the TrackNet and personalized face
model of the user via the ModelNet. The model and tracking parameters are then combined to
obtain 3D reconstruction. The networks are trained together in an end-to-end manner (marked in
red) by projecting the reconstructed outputs into 2D using a differentiable renderer and computing
multi-image consistency losses and other regularization losses.
Traditional methods for image-based 3D face reconstruction and
facial motion retargeting fit a 3D morphable model (3DMM) to the face, which
has limited modeling capacity and fail to generalize well to in-the-wild data.
Use of deformation transfer or multilinear tensor as a personalized 3DMM
for blendshape interpolation does not address the fact that facial expressions
result in different local and global skin deformations in different persons.
Moreover, existing methods learn a single albedo per user which is not enough
to capture the expression-specific skin reflectance variations. We propose an
end-to-end framework that jointly learns a personalized face model per user and
per-frame facial motion parameters from a large corpus of in-the-wild videos
of user expressions. Specifically, we learn user-specific expression blendshapes
and dynamic (expression-specific) albedo maps by predicting personalized
corrections on top of a 3DMM prior. We introduce novel training constraints to ensure that
the corrected blendshapes retain their semantic meanings and the reconstructed
geometry is disentangled from the albedo. Experimental results show that our
personalization accurately captures fine-grained facial dynamics in a wide range
of conditions and efficiently decouples the learned face model from facial motion,
resulting in more accurate face reconstruction and facial motion retargeting
compared to state-of-the-art methods.
Results for Images
Qualitative results of our method. Our modeling network accurately captures
high-fidelity facial details specific to the user, thereby enabling the tracking network to better learn
user-independent facial motion. Our network can handle a wide variety of expression, head pose, lighting conditions, age, ethnicity, facial hair and makeup etc.
Results for Videos