A deep learning approach for generalized speech animation

We introduce a simple and effective deep learning approach to automatically generate natural looking speech animation that synchronizes to input speech. Our approach uses a sliding window predictor that learns arbitrary nonlinear mappings from phoneme label input sequences to mouth movements in a way that accurately captures natural motion and visual coarticulation effects. Our deep learning approach enjoys several attractive properties: it runs in real-time, requires minimal parameter tuning, generalizes well to novel input speech sequences, is easily edited to create stylized and emotional speech, and is compatible with existing animation retargeting approaches. One important focus of our work is to develop an effective approach for speech animation that can be easily integrated into existing production pipelines. We provide a detailed description of our end-to-end approach, including machine learning design decisions. Generalized speech animation results are demonstrated over a wide range of animation clips on a variety of characters and voices, including singing and foreign language input. Our approach can also generate on-demand speech animation in real-time from user speech input.

[1]  Ricardo Gutierrez-Osuna,et al.  Audio/visual mapping with cross-modal hidden Markov models , 2005, IEEE Transactions on Multimedia.

[2]  Frank K. Soong,et al.  High quality lip-sync animation for 3D photo-realistic talking head , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Simon Baker,et al.  Active Appearance Models Revisited , 2004, International Journal of Computer Vision.

[4]  Jörn Ostermann,et al.  Evaluation of an image-based talking head with realistic facial expression and head motion , 2011, Journal on Multimodal User Interfaces.

[5]  Xin Tong,et al.  Leveraging motion capture and 3D scanning for high-fidelity facial performance acquisition , 2011, ACM Trans. Graph..

[6]  José Mario De Martino,et al.  Facial animation based on context-dependent visemes , 2006, Comput. Graph..

[7]  Lei Xie,et al.  Photo-real talking head with deep bidirectional LSTM , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Jovan Popovic,et al.  Deformation transfer for triangle meshes , 2004, ACM Trans. Graph..

[9]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[10]  Wesley Mattheyses,et al.  Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis , 2013, Speech Commun..

[11]  Kun Zhou,et al.  3D shape regression for real-time facial animation , 2013, ACM Trans. Graph..

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Björn Stenger,et al.  Expressive Visual Text-to-Speech Using Active Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Jihun Yu,et al.  Realtime facial animation with on-the-fly correctives , 2013, ACM Trans. Graph..

[15]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[16]  Johannes Fürnkranz,et al.  Decision Tree , 2010, Encyclopedia of Machine Learning and Data Mining.

[17]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[18]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[19]  Mark Pauly,et al.  Realtime performance-based facial animation , 2011, ACM Trans. Graph..

[20]  Jun Yu,et al.  Realtime speech-driven facial animation using Gaussian Mixture Models , 2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[21]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[22]  Derek Bradley,et al.  High-quality passive facial performance capture using anchor frames , 2011, ACM Trans. Graph..

[23]  Thabo Beeler,et al.  Real-time high-fidelity facial performance capture , 2015, ACM Trans. Graph..

[24]  Eugene Fiume,et al.  JALI , 2016, ACM Trans. Graph..

[25]  Tomaso Poggio,et al.  Trainable Videorealistic Speech Animation , 2004, FGR.

[26]  Simon King,et al.  Investigating the shortcomings of HMM synthesis , 2013, SSW.

[27]  Li Zhang,et al.  Spacetime faces: high resolution capture for modeling and animation , 2004, SIGGRAPH 2004.

[28]  Lei Xie,et al.  A coupled HMM approach to video-realistic speech animation , 2007, Pattern Recognit..

[29]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[30]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[31]  Yuyu Xu,et al.  A Practical and Configurable Lip Sync Method for Games , 2013, MIG.

[32]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[33]  Yisong Yue,et al.  A Decision Tree Framework for Spatiotemporal Sequence Prediction , 2015, KDD.

[34]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[35]  Dana Z. Anderson Neural information processing systems : Denver, Co, 1987 , 1988 .

[36]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[37]  Ronald A. Cole,et al.  Accurate visible speech synthesis based on concatenating variable length motion capture data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[38]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[39]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[40]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[41]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[42]  Dimitri Palaz,et al.  Towards End-to-End Speech Recognition , 2016 .

[43]  Salil Deena,et al.  Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model , 2010, ICMI-MLMI '10.

[44]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[45]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[46]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[47]  Kun Zhou,et al.  Real-time facial animation on mobile devices , 2014, Graph. Model..

[48]  Andrew Jones,et al.  Driving High-Resolution Facial Scans with Video Performance Capture , 2014, ACM Trans. Graph..

[49]  Gwenn Englebienne,et al.  A probabilistic model for generating realistic lip movements from speech , 2007, NIPS.

[50]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[51]  Hao Li,et al.  Realtime performance-based facial animation , 2011, ACM Trans. Graph..

[52]  Barry-John Theobald,et al.  Relating Objective and Subjective Performance Measures for AAM-Based Visual Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[53]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[54]  Gérard Bailly,et al.  A new trainable trajectory formation system for facial animation , 2006, ExLing.

[55]  Steve Young,et al.  The HTK book , 1995 .

[56]  Hans Peter Graf,et al.  Photo-Realistic Talking-Heads from Image Samples , 2000, IEEE Trans. Multim..

[57]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[58]  Michael Pucher,et al.  Simultaneous speech and animation synthesis , 2011, SIGGRAPH '11.

[59]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.