Speech-driven 3 D Facial Animation with Implicit Emotional Awareness : A Deep Learning Approach

We introduce a long short-term memory recurrent neural network (LSTM-RNN) approach for real-time facial animation, which automatically estimates head rotation and facial action unit activations of a speaker from just her speech. Specifically, the time-varying contextual non-linear mapping between audio stream and visual facial movements is realized by training a LSTM neural network on a large audio-visual data corpus. In this work, we extract a set of acoustic features from input audio, including Mel-scaled spectrogram, Mel frequency cepstral coefficients and chromagram that can effectively represent both contextual progression and emotional intensity of the speech. Output facial movements are characterized by 3D rotation and blending expression weights of a blendshape model, which can be used directly for animation. Thus, even though our model does not explicitly predict the affective states of the target speaker, her emotional manifestation is recreated via expression weights of the face model. Experiments on an evaluation dataset of different speakers across a wide range of affective states demonstrate promising results of our approach in real-time speech-driven facial animation.

[1]  Jörn Ostermann,et al.  Lifelike talking faces for interactive services , 2003, Proc. IEEE.

[2]  Lei Xie,et al.  Head motion synthesis from speech using deep neural networks , 2015, Multimedia Tools and Applications.

[3]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[4]  Tomaso A. Poggio,et al.  Reanimating Faces in Images and Video , 2003, Comput. Graph. Forum.

[5]  Björn Granström,et al.  SynFace—Speech-Driven Facial Animation for Virtual Speech-Reading Support , 2009, EURASIP J. Audio Speech Music. Process..

[6]  Carlos Busso,et al.  Facial Expression Recognition in the Presence of Speech Using Blind Lexical Compensation , 2016, IEEE Transactions on Affective Computing.

[7]  Lei Xie,et al.  Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling , 2007, IEEE Transactions on Multimedia.

[8]  Carlos Busso,et al.  Factorizing speaker, lexical and emotional variabilities observed in facial expressions , 2012, 2012 19th IEEE International Conference on Image Processing.

[9]  Hai Xuan Pham,et al.  Robust Real-Time 3D Face Tracking from RGBD Videos under Extreme Pose, Depth, and Expression Variation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[10]  Lianhong Cai,et al.  Real-time synthesis of Chinese visual speech and facial expressions using MPEG-4 FAP features in a three-dimensional avatar , 2006, INTERSPEECH.

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  Frank K. Soong,et al.  Synthesizing photo-real talking head via trajectory-guided sample selection , 2010, INTERSPEECH.

[13]  Yiying Tong,et al.  FaceWarehouse: A 3D Facial Expression Database for Visual Computing , 2014, IEEE Transactions on Visualization and Computer Graphics.

[14]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[15]  Frank K. Soong,et al.  A new language independent, photo-realistic talking head driven by voice only , 2013, INTERSPEECH.

[16]  Alice Wang,et al.  Assembling an expressive facial animation system , 2007, Sandbox '07.

[17]  Thomas Vetter,et al.  A morphable model for the synthesis of 3D faces , 1999, SIGGRAPH.

[18]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[19]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[20]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[21]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Hai Xuan Pham,et al.  Robust real-time performance-driven 3D face tracking , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[23]  Frank K. Soong,et al.  On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Tomaso Poggio,et al.  Trainable Videorealistic Speech Animation , 2004, FGR.

[25]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[26]  Frank K. Soong,et al.  A deep bidirectional LSTM approach for video-realistic talking head , 2016, Multimedia Tools and Applications.

[27]  Keiichi Tokuda,et al.  HMM-based text-to-audio-visual speech synthesis , 2000, INTERSPEECH.

[28]  Shrikanth Narayanan,et al.  Interplay between linguistic and affective goals in facial expression during emotional utterances , 2006 .

[29]  Frank K. Soong,et al.  Text Driven 3D Photo-Realistic Talking Head , 2011, INTERSPEECH.

[30]  Carlos Busso,et al.  Joint Analysis of the Emotional Fingerprint in the Face and Speech: A single subject study , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.