Audio-driven facial animation by joint end-to-end learning of pose and emotion

We present a machine learning technique for driving 3D facial animation by audio input in real time and with low latency. Our deep neural network learns a mapping from input waveforms to the 3D vertex coordinates of a face model, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone. During inference, the latent code can be used as an intuitive control for the emotional state of the face puppet. We train our network with 3--5 minutes of high-quality animation data obtained using traditional, vision-based performance capture methods. Even though our primary goal is to model the speaking style of a single actor, our model yields reasonable results even when driven with audio from other speakers with different gender, accent, or language, as we demonstrate with a user study. The results are applicable to in-game dialogue, low-cost localization, virtual reality avatars, and telepresence.

[1]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[2]  Frédéric H. Pighin,et al.  Unsupervised learning for speech motion editing , 2003, SCA '03.

[3]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[4]  Ben P. Milner,et al.  Audio-to-Visual Speech Conversion Using Deep Neural Networks , 2016, INTERSPEECH.

[5]  Li Zhang,et al.  Dynamic, expressive speech animation from a single mesh , 2007, SCA '07.

[6]  Demetri Terzopoulos,et al.  Multilinear subspace analysis of image ensembles , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[7]  John P. Lewis,et al.  Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces , 2006, IEEE Transactions on Visualization and Computer Graphics.

[8]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[9]  J. P. Lewis,et al.  Automated lip-synch and speech synthesis for character animation , 1987, CHI '87.

[10]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[11]  Christoph Bregler,et al.  Facial expression space learning , 2002, 10th Pacific Conference on Computer Graphics and Applications, 2002. Proceedings..

[12]  Salil Deena,et al.  Speech-Driven Facial Animation Using a Shared Gaussian Process Latent Variable Model , 2009, ISVC.

[13]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[14]  Eugene Fiume,et al.  JALI , 2016, ACM Trans. Graph..

[15]  Dominic W. Massaro,et al.  Animated speech: research progress and applications , 2001, AVSP.

[16]  Ken-ichi Anjyo,et al.  Practice and Theory of Blendshape Facial Models , 2014, Eurographics.

[17]  Frank K. Soong,et al.  A deep bidirectional LSTM approach for video-realistic talking head , 2016, Multimedia Tools and Applications.

[18]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[19]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[20]  Jörn Ostermann,et al.  Realistic facial expression synthesis for an image-based talking head , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[21]  Wesley Mattheyses,et al.  Audiovisual speech synthesis: An overview of the state-of-the-art , 2015, Speech Commun..

[22]  Salil Deena,et al.  Visual Speech Synthesis Using a Variable-Order Switching Shared Gaussian Process Dynamical Model , 2013, IEEE Transactions on Multimedia.

[23]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[24]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[25]  Valery A. Petrushin,et al.  How well can People and Computers Recognize Emotions in Speech , 1998 .

[26]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[28]  Michael Pucher,et al.  Joint Audiovisual Hidden Semi-Markov Model-Based Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[29]  Mario Malcangi,et al.  Text-driven avatars based on artificial neural networks and fuzzy logic , 2022 .

[30]  Korin Richmond,et al.  Comparison of HMM and TMDN methods for lip synchronisation , 2010, INTERSPEECH.

[31]  Fernando De la Torre,et al.  Emphatic Visual Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[34]  Nadia Magnenat-Thalmann,et al.  Lip synchronization using linear predictive analysis , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[35]  Zhigang Deng,et al.  Audio-based head motion synthesis for Avatar-based telepresence systems , 2004, ETP '04.

[36]  Giampiero Salvi,et al.  Using HMMs and ANNs for mapping acoustic to visual speech , 1999 .

[37]  Ahmed M. Elgammal,et al.  Separating style and content on a nonlinear manifold , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[38]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[39]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[40]  Jonas Beskow,et al.  Picture my voice: Audio to visual speech synthesis using artificial neural networks , 1999, AVSP.

[41]  Björn Stenger,et al.  Expressive Visual Text-to-Speech Using Active Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Frank K. Soong,et al.  HMM trajectory-guided sample selection for photo-realistic talking head , 2014, Multimedia Tools and Applications.

[43]  Jean-Luc Schwartz,et al.  No, There Is No 150 ms Lead of Visual Speech on Auditory Speech, but a Range of Audiovisual Asynchronies Varying from Small Audio Lead to Large Audio Lag , 2014, PLoS Comput. Biol..

[44]  John Lewis,et al.  Automated lip-sync: Background and techniques , 1991, Comput. Animat. Virtual Worlds.

[45]  Yuyu Xu,et al.  Virtual character performance from speech , 2013, SCA '13.

[46]  Jovan Popovic,et al.  Deformation transfer for triangle meshes , 2004, ACM Trans. Graph..

[47]  Lianhong Cai,et al.  Head and facial gestures synthesis using PAD model for an expressive talking avatar , 2014, Multimedia Tools and Applications.

[48]  Thomas S. Huang,et al.  Real-time speech-driven face animation with expressions using neural networks , 2002, IEEE Trans. Neural Networks.