Synthesising visual speech using dynamic visemes and deep learning architectures

This paper proposes and compares a range of methods to improve the naturalness of visual speech synthesis. A feedforward deep neural network (DNN) and many-to-one and many-to-many recurrent neural networks (RNNs) using long short-term memory (LSTM) are considered. Rather than using acoustically derived units of speech, such as phonemes, viseme representations are considered and we propose using dynamic visemes together with a deep learning framework. The input feature representation to the models is also investigated and we determine that including wide phoneme and viseme contexts is crucial for predicting realistic lip motions that are sufficiently smooth but not under-articulated. A detailed objective evaluation across a range of system configurations shows that a combined dynamic viseme-phoneme speech unit combined with a many-to-many encoder-decoder architecture models visual co-articulations effectively. Subjective preference tests reveal there to be no significant difference between animations produced using this system and using ground truth facial motion taken from the original video. Furthermore, the dynamic viseme system also outperforms significantly conventional phoneme-driven speech animation systems.

[1]  Petros Maragos,et al.  Video-realistic expressive audio-visual speech synthesis for the Greek language , 2017, Speech Commun..

[2]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[3]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[4]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[5]  Björn Stenger,et al.  Expressive Visual Text-to-Speech Using Active Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[7]  Wesley Mattheyses,et al.  Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis , 2013, Speech Commun..

[8]  Jonas Beskow,et al.  Visual phonemic ambiguity and speechreading. , 2006, Journal of speech, language, and hearing research : JSLHR.

[9]  Wesley Mattheyses,et al.  Automatic Viseme Clustering for Audiovisual Speech Synthesis , 2011, INTERSPEECH.

[10]  J. Gower Generalized procrustes analysis , 1975 .

[11]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Lei Xie,et al.  Photo-real talking head with deep bidirectional LSTM , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  José Mario De Martino,et al.  Facial animation based on context-dependent visemes , 2006, Comput. Graph..

[14]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[15]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[16]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[17]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[18]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[19]  Maurizio Omologo,et al.  Automatic segmentation and labeling of speech based on Hidden Markov Models , 1993, Speech Commun..

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Ben P. Milner,et al.  Analysis of correlation between audio and visual speech features for clean audio feature prediction in noise , 2006, INTERSPEECH.

[23]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[24]  Frank K. Soong,et al.  Synthesizing visual speech trajectory with minimum generation error , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[27]  Keith Waters,et al.  Computer Facial Animation, Second Edition , 1996 .

[28]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[29]  Lei Xie,et al.  Head motion synthesis from speech using deep neural networks , 2015, Multimedia Tools and Applications.

[30]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[31]  Barry-John Theobald,et al.  HMM-based visual speech synthesis using dynamic visemes , 2015, AVSP.

[32]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[33]  Keith Waters,et al.  Computer facial animation , 1996 .

[34]  Frank K. Soong,et al.  A deep bidirectional LSTM approach for video-realistic talking head , 2016, Multimedia Tools and Applications.

[35]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[36]  Yisong Yue,et al.  A Decision Tree Framework for Spatiotemporal Sequence Prediction , 2015, KDD.

[37]  Michael Pucher,et al.  Joint Audiovisual Hidden Semi-Markov Model-Based Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[38]  Jun Yu,et al.  Realtime speech-driven facial animation using Gaussian Mixture Models , 2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[39]  Larry B. Wallnau,et al.  Statistics for the Behavioral Sciences , 1985 .

[40]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[41]  Ranniery Maia,et al.  Expressive visual text to speech and expression adaptation using deep neural networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Zhizheng Wu,et al.  From HMMS to DNNS: Where do the improvements come from? , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Tomaso Poggio,et al.  Trainable Videorealistic Speech Animation , 2004, FGR.

[44]  Raúl Rojas,et al.  The Backpropagation Algorithm , 1996 .

[45]  Ben P. Milner,et al.  Audio-to-Visual Speech Conversion Using Deep Neural Networks , 2016, INTERSPEECH.