Self-supervised CNN for Unconstrained 3D Facial Performance Capture from an RGB-D Camera

We present a novel method for real-time 3D facial performance capture with consumer-level RGB-D sensors. Our capturing system is targeted at robust and stable 3D face capturing in the wild, in which the RGB-D facial data contain noise, imperfection and occlusion, and often exhibit high variability in motion, pose, expression and lighting conditions, thus posing great challenges. The technical contribution is a self-supervised deep learning framework, which is trained directly from raw RGB-D data. The key novelties include: (1) learning both the core tensor and the parameters for refining our parametric face model; (2) using vertex displacement and UV map for learning surface detail; (3) designing the loss function by incorporating temporal coherence and same identity constraints based on pairs of RGB-D images and utilizing sparse norms, in addition to the conventional terms for photo-consistency, feature similarity, regularization as well as geometry consistency; and (4) augmenting the training data set in new ways. The method is demonstrated in a live setup that runs in real-time on a smartphone and an RGB-D sensor. Extensive experiments show that our method is robust to severe occlusion, fast motion, large rotation, exaggerated facial expressions and diverse lighting.

[1]  Volker Schönefeld Spherical Harmonics , 2019, An Introduction to Radio Astronomy.

[2]  Wojciech Matusik,et al.  Multi-scale capture of facial geometry and motion , 2007, ACM Trans. Graph..

[3]  Ira Kemelmacher-Shlizerman,et al.  Face reconstruction in the wild , 2011, 2011 International Conference on Computer Vision.

[4]  Justus Thies,et al.  Face2Face: real-time face capture and reenactment of RGB videos , 2019, Commun. ACM.

[5]  Yiying Tong,et al.  FaceWarehouse: A 3D Facial Expression Database for Visual Computing , 2014, IEEE Transactions on Visualization and Computer Graphics.

[6]  Xin Jin,et al.  Face alignment in-the-wild: A Survey , 2016, Comput. Vis. Image Underst..

[7]  Patrick Pérez,et al.  MoFA: Model-Based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Sami Romdhani,et al.  A 3D Face Model for Pose and Illumination Invariant Face Recognition , 2009, 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.

[9]  Jianfei Cai,et al.  CNN-Based Real-Time Dense Face Reconstruction with Inverse-Rendered Photo-Realistic Face Images , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Shu Liang,et al.  3D Face Hallucination from a Single Depth Frame , 2014, 2014 2nd International Conference on 3D Vision.

[11]  Henrique S. Malvar,et al.  Making Faces , 2019, Topoi.

[12]  Tal Hassner,et al.  Regressing Robust and Discriminative 3D Morphable Models with a Very Deep Neural Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[14]  Matan Sela,et al.  3D Face Reconstruction by Learning from Synthetic Data , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[15]  Georgios Tzimiropoulos,et al.  How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Hao Li,et al.  Learning Dense Facial Correspondences in Unconstrained Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Dieter Fox,et al.  DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Mark Pauly,et al.  Dynamic 3D avatar creation from hand-held video input , 2015, ACM Trans. Graph..

[19]  Jun Li,et al.  Lightweight wrinkle synthesis for 3D facial modeling and animation , 2015, Comput. Aided Des..

[20]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[21]  Xiaoming Liu,et al.  Pose-Invariant Face Alignment via CNN-Based Dense 3D Model Fitting , 2017, International Journal of Computer Vision.

[22]  Xiang Li,et al.  Performance‐driven animation of hand‐drawn cartoon faces , 2011, Comput. Animat. Virtual Worlds.

[23]  Georgios Tzimiropoulos,et al.  Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  David Zhang,et al.  Convolutional Network for Attribute-driven and Identity-preserving Human Face Generation , 2016, ArXiv.

[25]  Xin Tong,et al.  Accurate and Robust 3D Facial Capture Using a Single RGBD Camera , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Ira Kemelmacher-Shlizerman,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 3d Face Reconstruction from a Single Image Using a Single Reference Face Shape , 2022 .

[27]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[28]  Olga Sorkine-Hornung,et al.  Geometric optimization via composite majorization , 2017, ACM Trans. Graph..

[29]  Jaakko Lehtinen,et al.  Production-level facial performance capture using deep convolutional neural networks , 2016, Symposium on Computer Animation.

[30]  Xiangyu Zhu,et al.  Face Alignment in Full Pose Range: A 3D Total Solution , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Hao Li,et al.  Real-Time Facial Segmentation and Performance Capture from RGB Input , 2016, ECCV.

[32]  Justus Thies,et al.  InverseFaceNet: Deep Monocular Inverse Face Rendering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Jovan Popović,et al.  Deformation transfer for triangle meshes , 2004, SIGGRAPH 2004.

[34]  Derek Bradley,et al.  High-quality passive facial performance capture using anchor frames , 2011, ACM Trans. Graph..

[35]  Jihun Yu,et al.  Realtime facial animation with on-the-fly correctives , 2013, ACM Trans. Graph..

[36]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[37]  Yangang Wang,et al.  Online modeling for realtime facial animation , 2013, ACM Trans. Graph..

[38]  Ron Kimmel,et al.  Unrestricted Facial Geometry Reconstruction Using Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Thabo Beeler,et al.  Real-time high-fidelity facial performance capture , 2015, ACM Trans. Graph..

[40]  Christian Theobalt,et al.  Reconstructing detailed dynamic face geometry from monocular video , 2013, ACM Trans. Graph..

[41]  Iasonas Kokkinos,et al.  DenseReg: Fully Convolutional Dense Shape Regression In-the-Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Ira Kemelmacher-Shlizerman,et al.  Total Moving Face Reconstruction , 2014, ECCV.

[43]  Kun Zhou,et al.  Displaced dynamic expression regression for real-time facial tracking and animation , 2014, ACM Trans. Graph..

[44]  Patrick Pérez,et al.  Automatic Face Reenactment , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Matan Sela,et al.  Learning Detailed Face Reconstruction from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Wojciech Matusik,et al.  Video face replacement , 2011, ACM Trans. Graph..

[47]  Sami Romdhani,et al.  Estimating 3D shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[48]  Jiwen Lu,et al.  MMSS: Multi-modal Sharable and Specific Feature Learning for RGB-D Object Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  Kun Zhou,et al.  Real-time facial animation on mobile devices , 2014, Graph. Model..

[50]  Derek Bradley,et al.  High resolution passive facial performance capture , 2010, ACM Trans. Graph..

[51]  Xin Tong,et al.  Automatic acquisition of high-fidelity facial performances using monocular videos , 2014, ACM Trans. Graph..

[52]  M. Zollhöfer,et al.  Self-Supervised Multi-level Face Model Learning for Monocular Reconstruction at Over 250 Hz , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Jihun Yu,et al.  Unconstrained realtime facial performance capture , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Christian Theobalt,et al.  Reconstruction of Personalized 3D Face Rigs from Monocular Video , 2016, ACM Trans. Graph..

[55]  Thomas Vetter,et al.  A morphable model for the synthesis of 3D faces , 1999, SIGGRAPH.

[56]  Justus Thies,et al.  Real-time expression transfer for facial reenactment , 2015, ACM Trans. Graph..

[57]  Hans-Peter Seidel,et al.  Lightweight binocular facial performance capture under uncontrolled lighting , 2012, ACM Trans. Graph..

[58]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Marios Savvides,et al.  Faster than Real-Time Facial Alignment: A 3D Spatial Transformer Network Approach in Unconstrained Poses , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[60]  Hao Li,et al.  Realtime performance-based facial animation , 2011, ACM Trans. Graph..

[61]  Patrick Pérez,et al.  State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications , 2018, Comput. Graph. Forum.