Facial Performance Capture with Deep Neural Networks

We present a deep learning technique for facial performance capture, i.e., the transfer of video footage into a motion sequence of a 3D mesh representing an actor’s face. Specifically, we build on a conventional capture pipeline based on computer vision and multiview video, and use its results to train a deep neural network to produce similar output from a monocular video sequence. Once trained, our network produces high-quality results for unseen inputs with greatly reduced effort compared to the conventional system. In practice, we have found that approximately 10 minutes worth of high-quality data is sufficient for training a network that can then automatically process as much footage from video to 3D as needed. This yields major savings in the development of modern narrativedriven video games involving digital doubles of actors and potentially hours of animated dialogue per character.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[3]  Jean Ponce,et al.  Dense 3D motion capture for human faces , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[5]  Thomas S. Huang,et al.  Real-time speech-driven face animation with expressions using neural networks , 2002, IEEE Trans. Neural Networks.

[6]  Kevin Köser Affine Registration , 2014, Computer Vision, A Reference Guide.

[7]  Li Zhang,et al.  Spacetime faces: high resolution capture for modeling and animation , 2004, SIGGRAPH 2004.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[10]  Derek Bradley,et al.  High-quality passive facial performance capture using anchor frames , 2011, ACM Trans. Graph..

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[13]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[15]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[17]  Wan-Chun Ma,et al.  Comprehensive Facial Performance Capture , 2011, Comput. Graph. Forum.

[18]  Mark Pauly,et al.  Realtime performance-based facial animation , 2011, ACM Trans. Graph..

[19]  Leonid Sigal,et al.  Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines , 2011, NIPS.

[20]  Rachel McDonnell,et al.  Facial retargeting using neural networks , 2014, MIG.

[21]  Zhigang Deng,et al.  Animating blendshape faces by cross-mapping motion capture data , 2006, I3D '06.

[22]  Colin Raffel,et al.  Lasagne: First release. , 2015 .

[23]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[24]  Lance Williams,et al.  Performance-driven facial animation , 1990, SIGGRAPH.