Speech-driven facial animation with spectral gathering and temporal attention

In this paper, we present an efficient algorithm that generates lip-synchronized facial animation from a given vocal audio clip. By combining spectral-dimensional bidirectional long short-term memory and temporal attention mechanism, we design a light-weight speech encoder that learns useful and robust vocal features from the input audio without resorting to pre-trained speech recognition modules or large training data. To learn subject-independent facial motion, we use deformation gradients as the internal representation, which allows nuanced local motions to be better synthesized than using vertex offsets. Compared with state-of-theart automatic-speech-recognition-based methods, our model is much smaller but achieves similar robustness and quality most of the time, and noticeably better results in certain challenging cases.

[1]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Hai Xuan Pham,et al.  End-to-end Learning for 3D Facial Animation from Speech , 2018, ICMI.

[3]  Ben P. Milner,et al.  The Effect of Real-Time Constraints on Automatic Speech Animation , 2018, INTERSPEECH.

[4]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[5]  Joo-Ho Lee,et al.  Talking heads synthesis from audio with deep neural networks , 2015, 2015 IEEE/SICE International Symposium on System Integration (SII).

[6]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[7]  Frank K. Soong,et al.  Text Driven 3D Photo-Realistic Talking Head , 2011, INTERSPEECH.

[8]  Frank K. Soong,et al.  A new language independent, photo-realistic talking head driven by voice only , 2013, INTERSPEECH.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Chenliang Xu,et al.  Generating Talking Face Landmarks from Speech , 2018, LVA/ICA.

[11]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[12]  Francis Rousseaux,et al.  Text-driven Mouth Animation for Human Computer Interaction With Personal Assistant , 2019, Proceedings of the 25th International Conference on Auditory Display (ICAD 2019).

[13]  Chenliang Xu,et al.  Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[15]  Tae-Hyun Oh,et al.  Speech2Face: Learning the Face Behind a Voice , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Maja Pantic,et al.  Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[17]  Heloir,et al.  The Uncanny Valley , 2019, The Animation Studies Reader.

[18]  Lin Gao,et al.  Sparse Data Driven Mesh Deformation , 2017, IEEE Transactions on Visualization and Computer Graphics.

[19]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[20]  Frédéric Jurie,et al.  An Occam's Razor View on Learning Audiovisual Emotion Recognition with Small Training Sets , 2018, ICMI.

[21]  Stephen D. Laycock,et al.  Joint Learning of Facial Expression and Head Pose from Speech , 2018, INTERSPEECH.

[22]  Hai Xuan Pham,et al.  Speech-Driven 3D Facial Animation with Implicit Emotional Awareness: A Deep Learning Approach , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[23]  Yuyu Xu,et al.  A Practical and Configurable Lip Sync Method for Games , 2013, MIG.

[24]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[25]  Stefanos Zafeiriou,et al.  Synthesising 3D Facial Motion from “In-the-Wild” Speech , 2019, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[26]  Yaser Sheikh,et al.  Real-time 3D neural facial animation from binocular video , 2021, ACM Trans. Graph..

[27]  Lei Xie,et al.  Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling , 2007, IEEE Transactions on Multimedia.

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  Misha Denil,et al.  Learning Where to Attend with Deep Architectures for Image Tracking , 2011, Neural Computation.

[30]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[31]  Rui Wang,et al.  Learning Discriminative Joint Embeddings for Efficient Face and Voice Association , 2020, SIGIR.

[32]  Jovan Popovic,et al.  Deformation transfer for triangle meshes , 2004, ACM Trans. Graph..

[33]  Tae-Hyun Oh,et al.  On Learning Associations of Faces and Voices , 2018, ACCV.

[34]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Wilmot Li,et al.  Real-Time Lip Sync for Live 2D Animation , 2019, ArXiv.

[36]  DeLiang Wang,et al.  Time and frequency domain long short-term memory for noise robust pitch tracking , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Mark Steedman,et al.  Generating Facial Expressions for Speech , 1996, Cogn. Sci..

[38]  Rui Wang,et al.  Deep Audio-visual Learning: A Survey , 2020, International Journal of Automation and Computing.

[39]  Maie Bachmann,et al.  Audiovisual emotion recognition in wild , 2018, Machine Vision and Applications.

[40]  Eugene Fiume,et al.  JALI , 2016, ACM Trans. Graph..

[41]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[42]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[43]  Jianfei Cai,et al.  Alive Caricature from 2D to 3D , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Jovan Popović,et al.  Deformation transfer for triangle meshes , 2004, SIGGRAPH 2004.

[45]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[47]  Michael J. Black,et al.  Capture, Learning, and Synthesis of 3D Speaking Styles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[49]  Alice Wang,et al.  Assembling an expressive facial animation system , 2007, Sandbox '07.

[50]  Barry-John Theobald,et al.  Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models , 2019, ICMI.

[51]  Jean-Luc Schwartz,et al.  No, There Is No 150 ms Lead of Visual Speech on Auditory Speech, but a Range of Audiovisual Asynchronies Varying from Small Audio Lead to Large Audio Lag , 2014, PLoS Comput. Biol..

[52]  Frank K. Soong,et al.  A deep bidirectional LSTM approach for video-realistic talking head , 2016, Multimedia Tools and Applications.

[53]  Raymond D. Kent,et al.  Coarticulation in recent speech production models , 1977 .

[54]  Verónica Orvalho,et al.  A Facial Rigging Survey , 2012, Eurographics.

[55]  Yong Liu,et al.  Audio2Face: Generating Speech/Face Animation from Single Audio with Attention-Based Bidirectional LSTM Networks , 2019, 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[56]  Karl F. MacDorman,et al.  The Uncanny Valley [From the Field] , 2012, IEEE Robotics Autom. Mag..

[57]  Kensuke Harada,et al.  Speech-Driven Facial Animation by LSTM-RNN for Communication Use , 2019, 2019 12th Asia Pacific Workshop on Mixed and Augmented Reality (APMAR).

[58]  Kun Zhou,et al.  Displaced dynamic expression regression for real-time facial tracking and animation , 2014, ACM Trans. Graph..

[59]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[60]  Jitendra Malik,et al.  Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Hao Li,et al.  Pinscreen avatars in your pocket: mobile paGAN engine and personalized gaming , 2018, SIGGRAPH 2018.

[62]  Tara N. Sainath,et al.  Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks , 2016, INTERSPEECH.