High-fidelity facial and speech animation for VR HMDs

Significant challenges currently prohibit expressive interaction in virtual reality (VR). Occlusions introduced by head-mounted displays (HMDs) make existing facial tracking techniques intractable, and even state-of-the-art techniques used for real-time facial tracking in unconstrained environments fail to capture subtle details of the user's facial expressions that are essential for compelling speech animation. We introduce a novel system for HMD users to control a digital avatar in real-time while producing plausible speech animation and emotional expressions. Using a monocular camera attached to an HMD, we record multiple subjects performing various facial expressions and speaking several phonetically-balanced sentences. These images are used with artist-generated animation data corresponding to these sequences to train a convolutional neural network (CNN) to regress images of a user's mouth region to the parameters that control a digital avatar. To make training this system more tractable, we use audio-based alignment techniques to map images of multiple users making the same utterance to the corresponding animation parameters. We demonstrate that this approach is also feasible for tracking the expressions around the user's eye region with an internal infrared (IR) camera, thereby enabling full facial tracking. This system requires no user-specific calibration, uses easily obtainable consumer hardware, and produces high-quality animations of speech and emotional expressions. Finally, we demonstrate the quality of our system on a variety of subjects and evaluate its performance against state-of-the-art real-time facial tracking techniques.

[1]  Zhengyou Zhang,et al.  Facial expression tracking from head-mounted, partially observing cameras , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Wen-Kai Tai,et al.  A Talking Face Driven by Voice using Hidden Markov Model , 2006, J. Inf. Sci. Eng..

[4]  Chongyang Ma,et al.  Facial performance sensing head-mounted display , 2015, ACM Trans. Graph..

[5]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[6]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[7]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[8]  Yangang Wang,et al.  Online modeling for realtime facial animation , 2013, ACM Trans. Graph..

[9]  Xin Tong,et al.  Automatic acquisition of high-fidelity facial performances using monocular videos , 2014, ACM Trans. Graph..

[10]  Derek Bradley,et al.  Detailed spatio-temporal reconstruction of eyelids , 2015, ACM Trans. Graph..

[11]  Lance Williams,et al.  Performance-driven facial animation , 1990, SIGGRAPH Courses.

[12]  Paul A. Beardsley,et al.  High-quality passive facial performance capture using anchor frames , 2011, SIGGRAPH 2011.

[13]  Lei Xie,et al.  A coupled HMM approach to video-realistic speech animation , 2007, Pattern Recognit..

[14]  Li Zhang,et al.  Dynamic, expressive speech animation from a single mesh , 2007, SCA '07.

[15]  Kun Zhou,et al.  Real-time facial animation on mobile devices , 2014, Graph. Model..

[16]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[17]  Alex Pentland,et al.  3D Modeling of Human Lip Motion , 1998, ICCV.

[18]  Ira Kemelmacher-Shlizerman,et al.  Total Moving Face Reconstruction , 2014, ECCV.

[19]  Yuting Ye,et al.  High fidelity facial animation capture and retargeting with contours , 2013, SCA '13.

[20]  Andrew Jones,et al.  Driving High-Resolution Facial Scans with Video Performance Capture , 2014, ACM Trans. Graph..

[21]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[22]  Luc Van Gool,et al.  Face/Off: live facial puppetry , 2009, SCA '09.

[23]  Kun Zhou,et al.  Displaced dynamic expression regression for real-time facial tracking and animation , 2014, ACM Trans. Graph..

[24]  Jihun Yu,et al.  Realtime facial animation with on-the-fly correctives , 2013, ACM Trans. Graph..

[25]  Jonas Beskow,et al.  Picture my voice: Audio to visual speech synthesis using artificial neural networks , 1999, AVSP.

[26]  W. Heidrich,et al.  High resolution passive facial performance capture , 2010, ACM Trans. Graph..

[27]  Mario Fritz,et al.  Appearance-based gaze estimation in the wild , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Kenji Suzuki,et al.  Design of a Wearable Device for Reading Positive Expressions from Facial EMG Signals , 2014, IEEE Transactions on Affective Computing.

[29]  覃政 Google Cardboard:伟大的搅局者 , 2014 .

[30]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[31]  Qiang Huo,et al.  Video-audio driven real-time facial animation , 2015, ACM Trans. Graph..

[32]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[33]  Christian Theobalt,et al.  Reconstructing detailed dynamic face geometry from monocular video , 2013, ACM Trans. Graph..

[34]  Lance Williams,et al.  Performance-driven facial animation , 1990, SIGGRAPH.

[35]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[36]  Christoph Bregler,et al.  Mood swings: expressive speech animation , 2005, TOGS.

[37]  Lei Xie,et al.  Photo-real talking head with deep bidirectional LSTM , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Demetri Terzopoulos,et al.  Physically-based facial modelling, analysis, and animation , 1990, Comput. Animat. Virtual Worlds.

[39]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[40]  K G Munhall,et al.  A model of facial biomechanics for speech production. , 1999, The Journal of the Acoustical Society of America.

[41]  Leonidas J. Guibas,et al.  Robust single-view geometry and motion reconstruction , 2009, SIGGRAPH 2009.

[42]  Mark Pauly,et al.  Example-based facial rigging , 2010, SIGGRAPH 2010.

[43]  Derek Bradley,et al.  High-quality passive facial performance capture using anchor frames , 2011, ACM Trans. Graph..

[44]  Fernando De la Torre,et al.  Supervised Descent Method and Its Applications to Face Alignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Leonidas J. Guibas,et al.  Robust single-view geometry and motion reconstruction , 2009, ACM Trans. Graph..

[46]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[47]  Dennis J. McFarland,et al.  Brain–computer interfaces for communication and control , 2002, Clinical Neurophysiology.

[48]  M. Pauly,et al.  Example-based facial rigging , 2010, ACM Trans. Graph..

[49]  Tony Ezzat,et al.  Visual Speech Synthesis by Morphing Visemes , 2000, International Journal of Computer Vision.

[50]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[51]  Alex Pentland,et al.  3D modeling and tracking of human lip motions , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[52]  Simon Lucey,et al.  Deformable Model Fitting by Regularized Landmark Mean-Shift , 2010, International Journal of Computer Vision.

[53]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[54]  Jing Xiao,et al.  Vision-based control of 3D facial animation , 2003, SCA '03.

[55]  Rosalind W. Picard,et al.  Expression glasses: a wearable device for facial expression recognition , 1999, CHI Extended Abstracts.

[56]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[57]  Mark Pauly,et al.  Realtime performance-based facial animation , 2011, ACM Trans. Graph..

[58]  Robert S. Allison,et al.  Combined Head-Eye Tracking for Immersive Virtual Reality , 2004 .

[59]  Jihun Yu,et al.  Unconstrained realtime facial performance capture , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Li Zhang,et al.  Spacetime faces: high resolution capture for modeling and animation , 2004, SIGGRAPH 2004.

[61]  Henrique S. Malvar,et al.  Making Faces , 2019, Topoi.

[62]  Kun Zhou,et al.  3D shape regression for real-time facial animation , 2013, ACM Trans. Graph..

[63]  Amnon Shashua,et al.  On the Sample Complexity of End-to-end Training vs. Semantic Abstraction Training , 2016, ArXiv.

[64]  John P. Lewis,et al.  Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces , 2006, IEEE Transactions on Visualization and Computer Graphics.

[65]  Thabo Beeler,et al.  Real-time high-fidelity facial performance capture , 2015, ACM Trans. Graph..

[66]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[67]  Anthony Steed,et al.  Lie tracking: social presence, truth and deception in avatar-mediated telecommunication , 2010, CHI.

[68]  Keith Waters,et al.  Computer facial animation , 1996 .

[69]  Xin Tong,et al.  Accurate and Robust 3D Facial Capture Using a Single RGBD Camera , 2013, 2013 IEEE International Conference on Computer Vision.