Audiovisual Speech Synthesis using Tacotron2

Audiovisual speech synthesis involves synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. To solve this problem, we propose using AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture. AVTacotron2 converts a sequence of phonemes into a sequence of acoustic features and the corresponding controllers of a face model. The output acoustic features are passed through a WaveRNN model to reconstruct the speech waveform. The speech waveform and the output facial controllers are used to generate the corresponding video of the talking face. As a baseline, we use a modular system, where acoustic speech is synthesized from text using the traditional Tacotron2. The reconstructed acoustic speech is then used to drive the controls of the face model using an independently trained audio-to-facial-animation neural network. We further condition both the end-to-end and modular approaches on emotion embeddings that encode the required prosody to generate emotional audiovisual speech. A comprehensive analysis shows that the end-to-end system is able to synthesize close to human-like audiovisual speech with mean opinion scores (MOS) of 4.1, which is the same MOS obtained on the ground truth generated from professionally recorded videos.

[1]  J. C. Cotton NORMAL "VISUAL HEARING". , 1935, Science.

[2]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[3]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[4]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[5]  Keiichi Tokuda,et al.  HMM-based text-to-audio-visual speech synthesis , 2000, INTERSPEECH.

[6]  Ashish Kapoor,et al.  Text-to-Audiovisual Speech Synthesizer , 2000, Virtual Worlds.

[7]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Li Zhang,et al.  Spacetime faces: high resolution capture for modeling and animation , 2004, SIGGRAPH 2004.

[9]  Ricardo Gutierrez-Osuna,et al.  Audio/visual mapping with cross-modal hidden Markov models , 2005, IEEE Transactions on Multimedia.

[10]  John Sabini,et al.  Ekman's basic emotions: Why not love and jealousy? , 2005 .

[11]  Lei Xie,et al.  Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling , 2007, IEEE Transactions on Multimedia.

[12]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[13]  Frank K. Soong,et al.  A real-time text to audio-visual speech synthesis system , 2008, INTERSPEECH.

[14]  Mark Pauly,et al.  Example-based facial rigging , 2010, SIGGRAPH 2010.

[15]  Paul A. Beardsley,et al.  High-quality passive facial performance capture using anchor frames , 2011, SIGGRAPH 2011.

[16]  Mark Pauly,et al.  Realtime performance-based facial animation , 2011, ACM Trans. Graph..

[17]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[18]  Paul Graham,et al.  Driving high-resolution facial blendshapes with video performance capture , 2013, SIGGRAPH '13.

[19]  Björn Stenger,et al.  An expressive text-driven 3D talking head , 2013, SIGGRAPH '13.

[20]  Kun Zhou,et al.  Real-time facial animation on mobile devices , 2014, Graph. Model..

[21]  Joo-Ho Lee,et al.  Talking heads synthesis from audio with deep neural networks , 2015, 2015 IEEE/SICE International Symposium on System Integration (SII).

[22]  Lei Xie,et al.  Photo-real talking head with deep bidirectional LSTM , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[24]  Thabo Beeler,et al.  Real-time high-fidelity facial performance capture , 2015, ACM Trans. Graph..

[25]  Yisong Yue,et al.  A Decision Tree Framework for Spatiotemporal Sequence Prediction , 2015, KDD.

[26]  Ben P. Milner,et al.  Audio-to-Visual Speech Conversion Using Deep Neural Networks , 2016, INTERSPEECH.

[27]  Kun Zhou,et al.  Real-time facial animation with image-based dynamic avatars , 2016, ACM Trans. Graph..

[28]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[29]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[30]  Ranniery Maia,et al.  Expressive visual text to speech and expression adaptation using deep neural networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Shiguang Shan,et al.  A Fully End-to-End Cascaded CNN for Facial Landmark Detection , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[32]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[33]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[35]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[36]  Subhransu Maji,et al.  Visemenet , 2018, ACM Trans. Graph..

[37]  Visemenet , 2018, ACM Transactions on Graphics.

[38]  Yoshua Bengio,et al.  ObamaNet: Photo-realistic lip-sync from text , 2017, ArXiv.

[39]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[40]  Barry-John Theobald,et al.  Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models , 2019, ICMI.

[41]  Philip N. Garner,et al.  Self-Attention for Speech Emotion Recognition , 2019, INTERSPEECH.

[42]  Erik Marchi,et al.  Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement , 2020, ArXiv.

[43]  Paul Dixon,et al.  Modality Dropout for Improved Performance-driven Talking Faces , 2020, ICMI.

[44]  Justus Thies,et al.  Neural Voice Puppetry: Audio-driven Facial Reenactment , 2019, ECCV.

[45]  Chen Change Loy,et al.  Everybody’s Talkin’: Let Me Talk as You Want , 2020, IEEE Transactions on Information Forensics and Security.