Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

Speech-driven visual speech synthesis involves mapping acoustic speech features to the corresponding lip animation controls for a face model. This mapping can take many forms, but a powerful approach is to use deep neural networks (DNNs). The lack of synchronized audio, video, and depth data is a limitation to reliably train DNNs, especially for speaker-independent models. In this paper, we investigate adapting an automatic speech recognition (ASR) acoustic model (AM) for the visual speech synthesis problem. We train the ASR-AM on ten thousand hours of audio-only transcribed speech. The ASR-AM is then adapted to the visual speech synthesis domain using ninety hours of synchronized audio-visual speech. Using a subjective assessment test, we compared the performance of the AM-initialized DNN to a randomly initialized model. The results show that viewers significantly prefer animations generated from the AM-initialized DNN than the ones generated using the randomly initialized model. We conclude that visual speech synthesis can significantly benefit from the powerful representation of speech in the ASR acoustic models.

[1]  Hai Xuan Pham,et al.  End-to-end Learning for 3D Facial Animation from Raw Waveforms of Speech , 2017, ArXiv.

[2]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[3]  Maja Pantic,et al.  End-to-End Speech-Driven Facial Animation with Temporal GANs , 2018, BMVC.

[4]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[5]  Hao Li,et al.  Realtime performance-based facial animation , 2011, ACM Trans. Graph..

[6]  M. Pauly,et al.  Example-based facial rigging , 2010, ACM Trans. Graph..

[7]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[8]  Levent M. Arslan,et al.  3-D Face Point Trajectory Synthesis Using An Automatically Derived Visual Phoneme Similarity Matrix , 1998, AVSP.

[9]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[10]  Lei Xie,et al.  A coupled HMM approach to video-realistic speech animation , 2007, Pattern Recognit..

[11]  P. Ekman,et al.  Facial action coding system , 2019 .

[12]  Ben P. Milner,et al.  Audio-to-Visual Speech Conversion Using Deep Neural Networks , 2016, INTERSPEECH.

[13]  Joo-Ho Lee,et al.  Talking heads synthesis from audio with deep neural networks , 2015, 2015 IEEE/SICE International Symposium on System Integration (SII).

[14]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[15]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[16]  Ricardo Gutierrez-Osuna,et al.  Audio/visual mapping with cross-modal hidden Markov models , 2005, IEEE Transactions on Multimedia.

[17]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[18]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[19]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[20]  Lei Xie,et al.  Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling , 2007, IEEE Transactions on Multimedia.

[21]  Hamid Aghajan,et al.  Speech-Driven Facial Reenactment Using Conditional Generative Adversarial Networks , 2018, ArXiv.

[22]  Ramón Fernández Astudillo,et al.  Noise-Adaptive LDA: A New Approach for Speech Recognition Under Observation Uncertainty , 2013, IEEE Signal Processing Letters.

[23]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[24]  Hans Peter Graf,et al.  Photo-Realistic Talking-Heads from Image Samples , 2000, IEEE Trans. Multim..

[25]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[26]  Jingwen Zhu,et al.  Talking Face Generation by Conditional Recurrent Adversarial Network , 2018, IJCAI.

[27]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[28]  Wesley Mattheyses,et al.  Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis , 2013, Speech Commun..

[29]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[30]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[31]  Shiguang Shan,et al.  A Fully End-to-End Cascaded CNN for Facial Landmark Detection , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[32]  Yisong Yue,et al.  A Decision Tree Framework for Spatiotemporal Sequence Prediction , 2015, KDD.

[33]  Lei Xie,et al.  Photo-real talking head with deep bidirectional LSTM , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[35]  Jenq-Neng Hwang,et al.  Hidden Markov Model Inversion for Audio-to-Visual Conversion in an MPEG-4 Facial Animation System , 2001, J. VLSI Signal Process..

[36]  Ricardo Gutierrez-Osuna,et al.  Speech-driven facial animation with realistic dynamics , 2005, IEEE Transactions on Multimedia.

[37]  Gérard Bailly,et al.  A new trainable trajectory formation system for facial animation , 2006, ExLing.

[38]  Björn Stenger,et al.  Expressive Visual Text-to-Speech Using Active Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Frank K. Soong,et al.  HMM trajectory-guided sample selection for photo-realistic talking head , 2014, Multimedia Tools and Applications.

[40]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[41]  Dorothea Kolossa,et al.  Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.