论文信息 - Speech-driven facial animation with spectral gathering and temporal attention

Speech-driven facial animation with spectral gathering and temporal attention

In this paper, we present an efficient algorithm that generates lip-synchronized facial animation from a given vocal audio clip. By combining spectral-dimensional bidirectional long short-term memory and temporal attention mechanism, we design a light-weight speech encoder that learns useful and robust vocal features from the input audio without resorting to pre-trained speech recognition modules or large training data. To learn subject-independent facial motion, we use deformation gradients as the internal representation, which allows nuanced local motions to be better synthesized than using vertex offsets. Compared with state-of-theart automatic-speech-recognition-based methods, our model is much smaller but achieves similar robustness and quality most of the time, and noticeably better results in certain challenging cases.

Lvdi Wang | Yanlin Weng | Kun Zhou | Yujin Chai

[1] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Hai Xuan Pham,et al. End-to-end Learning for 3D Facial Animation from Speech , 2018, ICMI.

[3] Ben P. Milner,et al. The Effect of Real-Time Constraints on Automatic Speech Animation , 2018, INTERSPEECH.

[4] James H. Martin,et al. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[5] Joo-Ho Lee,et al. Talking heads synthesis from audio with deep neural networks , 2015, 2015 IEEE/SICE International Symposium on System Integration (SII).

[6] Ira Kemelmacher-Shlizerman,et al. Synthesizing Obama , 2017, ACM Trans. Graph..

[7] Frank K. Soong,et al. Text Driven 3D Photo-Realistic Talking Head , 2011, INTERSPEECH.

[8] Frank K. Soong,et al. A new language independent, photo-realistic talking head driven by voice only , 2013, INTERSPEECH.

[9] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10] Chenliang Xu,et al. Generating Talking Face Landmarks from Speech , 2018, LVA/ICA.

[11] Michael M. Cohen,et al. Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[12] Francis Rousseaux,et al. Text-driven Mouth Animation for Human Computer Interaction With Personal Assistant , 2019, Proceedings of the 25th International Conference on Auditory Display (ICAD 2019).

[13] Chenliang Xu,et al. Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[15] Tae-Hyun Oh,et al. Speech2Face: Learning the Face Behind a Voice , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Maja Pantic,et al. Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[17] Heloir,et al. The Uncanny Valley , 2019, The Animation Studies Reader.

[18] Lin Gao,et al. Sparse Data Driven Mesh Deformation , 2017, IEEE Transactions on Visualization and Computer Graphics.

[19] Yisong Yue,et al. A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[20] Frédéric Jurie,et al. An Occam's Razor View on Learning Audiovisual Emotion Recognition with Small Training Sets , 2018, ICMI.

[21] Stephen D. Laycock,et al. Joint Learning of Facial Expression and Head Pose from Speech , 2018, INTERSPEECH.

[22] Hai Xuan Pham,et al. Speech-Driven 3D Facial Animation with Implicit Emotional Awareness: A Deep Learning Approach , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[23] Yuyu Xu,et al. A Practical and Configurable Lip Sync Method for Games , 2013, MIG.

[24] Moshe Mahler,et al. Dynamic units of visual speech , 2012, SCA '12.

[25] Stefanos Zafeiriou,et al. Synthesising 3D Facial Motion from “In-the-Wild” Speech , 2019, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[26] Yaser Sheikh,et al. Real-time 3D neural facial animation from binocular video , 2021, ACM Trans. Graph..

[27] Lei Xie,et al. Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling , 2007, IEEE Transactions on Multimedia.

[28] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[29] Misha Denil,et al. Learning Where to Attend with Deep Architectures for Image Tracking , 2011, Neural Computation.

[30] Christoph Bregler,et al. Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[31] Rui Wang,et al. Learning Discriminative Joint Embeddings for Efficient Face and Voice Association , 2020, SIGIR.

[32] Jovan Popovic,et al. Deformation transfer for triangle meshes , 2004, ACM Trans. Graph..

[33] Tae-Hyun Oh,et al. On Learning Associations of Faces and Voices , 2018, ACCV.

[34] Gerald Penn,et al. Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35] Wilmot Li,et al. Real-Time Lip Sync for Live 2D Animation , 2019, ArXiv.

[36] DeLiang Wang,et al. Time and frequency domain long short-term memory for noise robust pitch tracking , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37] Mark Steedman,et al. Generating Facial Expressions for Speech , 1996, Cogn. Sci..

[38] Rui Wang,et al. Deep Audio-visual Learning: A Survey , 2020, International Journal of Automation and Computing.

[39] Maie Bachmann,et al. Audiovisual emotion recognition in wild , 2018, Machine Vision and Applications.

[40] Eugene Fiume,et al. JALI , 2016, ACM Trans. Graph..

[41] Matthew Brand,et al. Voice puppetry , 1999, SIGGRAPH.

[42] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[43] Jianfei Cai,et al. Alive Caricature from 2D to 3D , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44] Jovan Popović,et al. Deformation transfer for triangle meshes , 2004, SIGGRAPH 2004.

[45] Ryan Prenger,et al. Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[47] Michael J. Black,et al. Capture, Learning, and Synthesis of 3D Speaking Styles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Jaakko Lehtinen,et al. Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[49] Alice Wang,et al. Assembling an expressive facial animation system , 2007, Sandbox '07.

[50] Barry-John Theobald,et al. Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models , 2019, ICMI.

[51] Jean-Luc Schwartz,et al. No, There Is No 150 ms Lead of Visual Speech on Auditory Speech, but a Range of Audiovisual Asynchronies Varying from Small Audio Lead to Large Audio Lag , 2014, PLoS Comput. Biol..

[52] Frank K. Soong,et al. A deep bidirectional LSTM approach for video-realistic talking head , 2016, Multimedia Tools and Applications.

[53] Raymond D. Kent,et al. Coarticulation in recent speech production models , 1977 .

[54] Verónica Orvalho,et al. A Facial Rigging Survey , 2012, Eurographics.

[55] Yong Liu,et al. Audio2Face: Generating Speech/Face Animation from Single Audio with Attention-Based Bidirectional LSTM Networks , 2019, 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[56] Karl F. MacDorman,et al. The Uncanny Valley [From the Field] , 2012, IEEE Robotics Autom. Mag..

[57] Kensuke Harada,et al. Speech-Driven Facial Animation by LSTM-RNN for Communication Use , 2019, 2019 12th Asia Pacific Workshop on Mixed and Augmented Reality (APMAR).

[58] Kun Zhou,et al. Displaced dynamic expression regression for real-time facial tracking and animation , 2014, ACM Trans. Graph..

[59] Tony Ezzat,et al. Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[60] Jitendra Malik,et al. Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61] Hao Li,et al. Pinscreen avatars in your pocket: mobile paGAN engine and personalized gaming , 2018, SIGGRAPH 2018.

[62] Tara N. Sainath,et al. Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks , 2016, INTERSPEECH.