Noise-Resilient Training Method for Face Landmark Generation From Speech

Visual cues such as lip movements, when available, play an important role in speech communication. They are especially helpful for the hearing impaired population or in noisy environments. When not available, having a system to automatically generate talking faces in sync with input speech would enhance speech communication and enable many novel applications. In this article, we present a new system that can generate 3D talking face landmarks from speech in an online fashion. We employ a neural network that accepts the raw waveform as an input. The network contains convolutional layers with 1D kernels and outputs the active shape model (ASM) coefficients of face landmarks. To promote smoother transitions between video frames, we present a variant of the model that has the same architecture but also accepts the previous frame's ASM coefficients as an additional input. To cope with background noise, we propose a new training method to incorporate speech enhancement ideas at the feature level. Objective evaluations on landmark prediction show that the proposed system yields statistically significantly smaller errors than two state-of-the-art baseline methods on both a single-speaker dataset and a multi-speaker dataset. Experiments on noisy speech input with five types of non-stationary unseen noise show statistically significant improvements of the system performance thanks to the noise-resilient training method. Finally, subjective evaluations show that the generated talking faces have a significantly more convincing match with the input audio, achieving a similarly convincing level of realism as the ground-truth landmarks.

[1]  Jenq-Neng Hwang,et al.  Hidden Markov Model Inversion for Audio-to-Visual Conversion in an MPEG-4 Facial Animation System , 2001, J. VLSI Signal Process..

[2]  Lei Xie,et al.  Photo-real talking head with deep bidirectional LSTM , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Adrian K. C. Lee,et al.  Auditory selective attention is enhanced by a task-irrelevant temporally coherent visual stimulus in human listeners , 2015, eLife.

[4]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[6]  Frank K. Soong,et al.  Text Driven 3D Photo-Realistic Talking Head , 2011, INTERSPEECH.

[7]  J. Gower Generalized procrustes analysis , 1975 .

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Israel Cohen,et al.  Audio-Visual Voice Activity Detection Using Diffusion Maps , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Paul L. Rosin,et al.  VIDEO REALISTIC TALKING HEADS USING HIERARCHICAL NON-LINEAR SPEECH-APPEARANCE MODELS , 2003 .

[11]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Yuxiao Hu,et al.  Real-time conversion from a single 2D face image to a 3D text-driven emotive audio-visual avatar , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[13]  Lei Xie,et al.  A statistical parametric approach to video-realistic text-driven talking avatar , 2013, Multimedia Tools and Applications.

[14]  Chenliang Xu,et al.  Generating Talking Face Landmarks from Speech , 2018, LVA/ICA.

[15]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[16]  Houqiang Li,et al.  Sign Language Recognition using 3D convolutional neural networks , 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME).

[17]  Frank K. Soong,et al.  A new language independent, photo-realistic talking head driven by voice only , 2013, INTERSPEECH.

[18]  Md. Zakir Hossain,et al.  A Comprehensive Survey of Deep Learning for Image Captioning , 2018, ACM Comput. Surv..

[19]  Hai Xuan Pham,et al.  Speech-Driven 3D Facial Animation with Implicit Emotional Awareness: A Deep Learning Approach , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[20]  Björn Stenger,et al.  Expressive visual text-to-speech as an assistive technology for individuals with autism spectrum conditions , 2016, Comput. Vis. Image Underst..

[21]  Hai Xuan Pham,et al.  End-to-end Learning for 3D Facial Animation from Raw Waveforms of Speech , 2017, ArXiv.

[22]  Joshua G. W. Bernstein,et al.  Auditory and auditory-visual intelligibility of speech in fluctuating maskers for normal-hearing and hearing-impaired listeners. , 2009, The Journal of the Acoustical Society of America.

[23]  Paul L. Rosin,et al.  Speech driven facial animation using a hidden Markov coarticulation model , 2004, ICPR 2004.

[24]  Maja Pantic,et al.  End-to-end visual speech recognition with LSTMS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[26]  Lei Xie,et al.  A coupled HMM approach to video-realistic speech animation , 2007, Pattern Recognit..

[27]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[28]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[29]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[30]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[31]  Mark J. F. Gales,et al.  Photo-realistic expressive text to talking head synthesis , 2013, INTERSPEECH.

[32]  Simon Baker,et al.  Active Appearance Models Revisited , 2004, International Journal of Computer Vision.

[33]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[34]  Georgios Tzimiropoulos,et al.  How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Takeo Kanade,et al.  Multi-PIE , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[36]  Joon Son Chung,et al.  You said that? , 2017, BMVC.

[37]  Chenliang Xu,et al.  Lip Movements Generation at a Glance , 2018, ECCV.

[38]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[39]  Lucas D. Terissi,et al.  Audio-to-Visual Conversion Via HMM Inversion for Speech-Driven Facial Animation , 2008, SBIA.

[40]  M. Shamim Hossain,et al.  Audio-Visual Emotion Recognition Using Big Data Towards 5G , 2016, Mob. Networks Appl..

[41]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[42]  R. Freyman,et al.  The role of visual speech cues in reducing energetic and informational masking. , 2005, The Journal of the Acoustical Society of America.