Production-level facial performance capture using deep convolutional neural networks

We present a real-time deep learning framework for video-based facial performance capture---the dense 3D tracking of an actor's face given a monocular video. Our pipeline begins with accurately capturing a subject using a high-end production facial capture pipeline based on multi-view stereo tracking and artist-enhanced animations. With 5--10 minutes of captured footage, we train a convolutional neural network to produce high-quality output, including self-occluded regions, from a monocular video sequence of that subject. Since this 3D facial performance capture is fully automated, our system can drastically reduce the amount of labor involved in the development of modern narrative-driven video games or films involving realistic digital doubles of actors and potentially hours of animated dialogue per character. We compare our results with several state-of-the-art monocular real-time facial capture techniques and demonstrate compelling animation inference in challenging areas such as eyes and lips.

[1]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[2]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[3]  E. Sargent,et al.  Measurement of facial movement with computer software. , 1998, Archives of otolaryngology--head & neck surgery.

[4]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[5]  Thomas Vetter,et al.  A morphable model for the synthesis of 3D faces , 1999, SIGGRAPH.

[6]  Mikkel B. Stegmann,et al.  Active appearance models: Theory and cases , 2000 .

[7]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[9]  Gérard Bailly,et al.  Audiovisual Speech Synthesis , 2003, Int. J. Speech Technol..

[10]  John P. Lewis,et al.  Universal capture: image-based facial animation for "The Matrix Reloaded" , 2003, SIGGRAPH 2003.

[11]  Steven M. Seitz,et al.  Spacetime faces , 2004, ACM Trans. Graph..

[12]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[13]  Hanspeter Pfister,et al.  Face transfer with multilinear models , 2005, SIGGRAPH 2005.

[14]  Henrique S. Malvar,et al.  Making faces , 1998, SIGGRAPH Courses.

[15]  Lance Williams,et al.  Performance-driven facial animation , 1990, SIGGRAPH Courses.

[16]  Markus H. Gross,et al.  Pose-space animation and transfer of facial details , 2008, SCA '08.

[17]  Jean Ponce,et al.  Dense 3D motion capture for human faces , 2009, CVPR.

[18]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[19]  Paul Debevec,et al.  The Digital Emily project: photoreal facial modeling and animation , 2009, SIGGRAPH '09.

[20]  Luc Van Gool,et al.  Face/Off: live facial puppetry , 2009, SCA '09.

[21]  Simon Lucey,et al.  Deformable model fitting with a mixture of local experts , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  Derek Bradley,et al.  High resolution passive facial performance capture , 2010, SIGGRAPH 2010.

[23]  Simon Lucey,et al.  Deformable Model Fitting by Regularized Landmark Mean-Shift , 2010, International Journal of Computer Vision.

[24]  Hao Li,et al.  Example-based facial rigging , 2010, ACM Transactions on Graphics.

[25]  Paul A. Beardsley,et al.  High-quality passive facial performance capture using anchor frames , 2011, SIGGRAPH 2011.

[26]  Timothy F. Cootes,et al.  Real-Time Facial Feature Tracking on a Mobile Device , 2011, International Journal of Computer Vision.

[27]  Hao Li,et al.  Realtime performance-based facial animation , 2011, ACM Trans. Graph..

[28]  Wan-Chun Ma,et al.  Comprehensive Facial Performance Capture , 2011, Comput. Graph. Forum.

[29]  Hans-Peter Seidel,et al.  Lightweight binocular facial performance capture under uncontrolled lighting , 2012, ACM Trans. Graph..

[30]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  Yuting Ye,et al.  High fidelity facial animation capture and retargeting with contours , 2013, SCA '13.

[33]  Yangang Wang,et al.  Online modeling for realtime facial animation , 2013, ACM Trans. Graph..

[34]  Kun Zhou,et al.  3D shape regression for real-time facial animation , 2013, ACM Trans. Graph..

[35]  Jihun Yu,et al.  Realtime facial animation with on-the-fly correctives , 2013, ACM Trans. Graph..

[36]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[37]  Andrew Jones,et al.  Driving High-Resolution Facial Scans with Video Performance Capture , 2014, ACM Trans. Graph..

[38]  Kun Zhou,et al.  Displaced dynamic expression regression for real-time facial tracking and animation , 2014, ACM Trans. Graph..

[39]  Kun Zhou,et al.  Real-time facial animation on mobile devices , 2014, Graph. Model..

[40]  Xin Tong,et al.  Automatic acquisition of high-fidelity facial performances using monocular videos , 2014, ACM Trans. Graph..

[41]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Mark Pauly,et al.  Dynamic 3D avatar creation from hand-held video input , 2015, ACM Trans. Graph..

[43]  Derek Bradley,et al.  Detailed spatio-temporal reconstruction of eyelids , 2015, ACM Trans. Graph..

[44]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[45]  Colin Raffel,et al.  Lasagne: First release. , 2015 .

[46]  Wesley Mattheyses,et al.  Audiovisual speech synthesis: An overview of the state-of-the-art , 2015, Speech Commun..

[47]  Justus Thies,et al.  Real-time expression transfer for facial reenactment , 2015, ACM Trans. Graph..

[48]  Jihun Yu,et al.  Unconstrained realtime facial performance capture , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[50]  Thabo Beeler,et al.  Real-time high-fidelity facial performance capture , 2015, ACM Trans. Graph..

[51]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[52]  Qiang Huo,et al.  Video-audio driven real-time facial animation , 2015, ACM Trans. Graph..

[53]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[54]  Mario Fritz,et al.  Appearance-based gaze estimation in the wild , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Eugene Fiume,et al.  JALI , 2016, ACM Trans. Graph..

[56]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[57]  Joseph J. Lim,et al.  High-fidelity facial and speech animation for VR HMDs , 2016, ACM Trans. Graph..

[58]  Derek Bradley,et al.  An anatomically-constrained local deformation model for monocular face capture , 2016, ACM Trans. Graph..

[59]  Hao Li,et al.  Real-Time Facial Segmentation and Performance Capture from RGB Input , 2016, ECCV.

[60]  Patrick Pérez,et al.  Corrective 3D reconstruction of lips from monocular video , 2016, ACM Trans. Graph..

[61]  P. Ekman,et al.  Facial action coding system , 2019 .

[62]  Justus Thies,et al.  Face2Face: real-time face capture and reenactment of RGB videos , 2019, Commun. ACM.