Real-Time Facial Segmentation and Performance Capture from RGB Input

We introduce the concept of unconstrained real-time 3D facial performance capture through explicit semantic segmentation in the RGB input. To ensure robustness, cutting edge supervised learning approaches rely on large training datasets of face images captured in the wild. While impressive tracking quality has been demonstrated for faces that are largely visible, any occlusion due to hair, accessories, or hand-to-face gestures would result in significant visual artifacts and loss of tracking accuracy. The modeling of occlusions has been mostly avoided due to its immense space of appearance variability. To address this curse of high dimensionality, we perform tracking in unconstrained images assuming non-face regions can be fully masked out. Along with recent breakthroughs in deep learning, we demonstrate that pixel-level facial segmentation is possible in real-time by repurposing convolutional neural networks designed originally for general semantic segmentation. We develop an efficient architecture based on a two-stream deconvolution network with complementary characteristics, and introduce carefully designed training samples and data augmentation strategies for improved segmentation accuracy and robustness. We adopt a state-of-the-art regression-based facial tracking framework with segmented face images as training, and demonstrate accurate and uninterrupted facial performance capture in the presence of extreme occlusion and even side views. Furthermore, the resulting segmentation can be directly used to composite partial 3D face models on the input images and enable seamless facial manipulation tasks, such as virtual make-up or face replacement.

[1]  Wan-Chun Ma,et al.  Comprehensive Facial Performance Capture , 2011, Comput. Graph. Forum.

[2]  Thabo Beeler,et al.  Real-time high-fidelity facial performance capture , 2015, ACM Trans. Graph..

[3]  Zhe L. Lin,et al.  Exemplar-Based Face Parsing , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Andrew Jones,et al.  Driving High-Resolution Facial Scans with Video Performance Capture , 2014, ACM Trans. Graph..

[5]  Alex Pentland,et al.  Modeling, tracking and interactive animation of faces and heads//using input from video , 1996, Proceedings Computer Animation '96.

[6]  Dimitris N. Metaxas,et al.  Optical Flow Constraints on Deformable Models with Applications to Face Tracking , 2000, International Journal of Computer Vision.

[7]  Hao Li,et al.  Realtime performance-based facial animation , 2011, ACM Trans. Graph..

[8]  Shridhar Ravikumar,et al.  Performance driven facial animation with blendshapes , 2018 .

[9]  M. Pauly,et al.  Example-based facial rigging , 2010, ACM Trans. Graph..

[10]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[11]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[12]  Derek Bradley,et al.  High-quality passive facial performance capture using anchor frames , 2011, ACM Trans. Graph..

[13]  Stephen M. Omohundro,et al.  Surface Learning with Applications to Lipreading , 1993, NIPS.

[14]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[15]  Xiaogang Wang,et al.  Hierarchical face parsing via deep learning , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  J. Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[17]  Yuting Ye,et al.  High fidelity facial animation capture and retargeting with contours , 2013, SCA '13.

[18]  Charless C. Fowlkes,et al.  Occlusion Coherence: Localizing Occluded Faces with a Hierarchical Deformable Part Model , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Thomas Vetter,et al.  A morphable model for the synthesis of 3D faces , 1999, SIGGRAPH.

[20]  Charless C. Fowlkes,et al.  Using Segmentation to Predict the Absence of Occluded Parts , 2015, BMVC.

[21]  Pertti Roivainen,et al.  3-D Motion Estimation in Model-Based Facial Image Coding , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Xin Tong,et al.  Automatic acquisition of high-fidelity facial performances using monocular videos , 2014, ACM Trans. Graph..

[23]  Ioannis Patras,et al.  Robust Face Alignment Under Occlusion via Regional Predictive Power Estimation , 2015, IEEE Transactions on Image Processing.

[24]  Yangang Wang,et al.  Online modeling for realtime facial animation , 2013, ACM Trans. Graph..

[25]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[26]  Dimitris N. Metaxas,et al.  Consensus of Regression for Occlusion-Robust Facial Feature Localization , 2014, ECCV.

[27]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[28]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[29]  Luc Van Gool,et al.  Face/Off: live facial puppetry , 2009, SCA '09.

[30]  Jihun Yu,et al.  Unconstrained realtime facial performance capture , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  David Salesin,et al.  Resynthesizing facial animation through 3D model-based tracking , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[32]  Simon Lucey,et al.  Deformable Model Fitting by Regularized Landmark Mean-Shift , 2010, International Journal of Computer Vision.

[33]  Ioannis Patras,et al.  Structured Semi-supervised Forest for Facial Landmarks Localization with Face Mask Reasoning , 2014, BMVC.

[34]  Jian Sun,et al.  Face Alignment by Explicit Shape Regression , 2012, International Journal of Computer Vision.

[35]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[36]  Steven M. Seitz,et al.  Spacetime faces , 2004, ACM Trans. Graph..

[37]  Jian Sun,et al.  Face Alignment at 3000 FPS via Regressing Local Binary Features , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  David Cristinacce,et al.  Automatic feature localisation with constrained local models , 2008, Pattern Recognit..

[39]  Andrew Zisserman,et al.  Hand detection using multiple proposals , 2011, BMVC.

[40]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41]  Fernando De la Torre,et al.  Supervised Descent Method and Its Applications to Face Alignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  한보형,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015 .

[44]  Michael J. Black,et al.  Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion , 1995, Proceedings of IEEE International Conference on Computer Vision.

[45]  Zaïd Harchaoui,et al.  On learning to localize objects with minimal supervision , 2014, ICML.

[46]  Jing Xiao,et al.  Vision-based control of 3D facial animation , 2003, SCA '03.

[47]  Jean Ponce,et al.  Dense 3D motion capture for human faces , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Pietro Perona,et al.  Robust Face Landmark Estimation under Occlusion , 2013, 2013 IEEE International Conference on Computer Vision.

[49]  Jihun Yu,et al.  Realtime facial animation with on-the-fly correctives , 2013, ACM Trans. Graph..

[50]  Christian Theobalt,et al.  Reconstructing detailed dynamic face geometry from monocular video , 2013, ACM Trans. Graph..

[51]  Deva Ramanan,et al.  Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[53]  Henrique S. Malvar,et al.  Making Faces , 2019, Topoi.

[54]  Ralph Gross,et al.  Active appearance models with occlusion , 2006, Image Vis. Comput..

[55]  Kun Zhou,et al.  3D shape regression for real-time facial animation , 2013, ACM Trans. Graph..

[56]  Erika Chuang,et al.  Performance Driven Facial Animation using Blendshape Interpolation , 2002 .

[57]  Yiying Tong,et al.  FaceWarehouse: A 3D Facial Expression Database for Visual Computing , 2014, IEEE Transactions on Visualization and Computer Graphics.

[58]  Leonidas J. Guibas,et al.  Robust single-view geometry and motion reconstruction , 2009, ACM Trans. Graph..

[59]  Ira Kemelmacher-Shlizerman,et al.  Total Moving Face Reconstruction , 2014, ECCV.

[60]  Kun Zhou,et al.  Displaced dynamic expression regression for real-time facial tracking and animation , 2014, ACM Trans. Graph..

[61]  Tao Xiang,et al.  In Defence of Negative Mining for Annotating Weakly Labelled Data , 2012, ECCV.

[62]  Gregory D. Hager,et al.  Fast and Globally Convergent Pose Estimation from Video Images , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[63]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.