Identity-Preserving Realistic Talking Face Generation

Speech-driven facial animation is useful for a variety of applications such as telepresence, chatbots, etc. The necessary attributes of having a realistic face animation are 1) audiovisual synchronization (2) identity preservation of the target individual (3) plausible mouth movements (4) presence of natural eye blinks. The existing methods mostly address the audiovisual lip synchronization, and few recent works have addressed synthesis of natural eye blinks for overall video realism. In this paper, we propose a method for identity-preserving realistic facial animation from speech. We first generate person-independent facial landmarks from audio using DeepSpeech features for invariance to different voices, accents, etc. To add realism, we impose eye blinks on facial landmarks using unsupervised learning and retarget the person-independent landmarks to person-specific landmarks to preserve the identity-related facial structure which helps in generation of plausible mouth shapes of the target identity. Finally, we use LSGAN to generate the facial texture from person-specific facial landmarks, using an attention mechanism that helps to preserve identity-related texture. An extensive comparison of our proposed method with the current state-of-the-art methods demonstrate a significant improvement in terms of lip synchronization accuracy, image reconstruction quality, sharpness, and identity-preservation. A user study also reveals improved realism of our animation results over the state-of-the-art methods. To the best of our knowledge, this is the first work in speech-driven 2D facial animation that simultaneously addresses all the above-mentioned attributes of a realistic speech driven face animation.

[1]  Gang Yu,et al.  BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation , 2018, ECCV.

[2]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[3]  Lei Xie,et al.  Photo-real talking head with deep bidirectional LSTM , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[5]  Hao Zhu,et al.  High-Resolution Talking Face Generation via Mutual Information Approximation , 2018, ArXiv.

[6]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[7]  Karl F. MacDorman,et al.  The Uncanny Valley [From the Field] , 2012, IEEE Robotics Autom. Mag..

[8]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jingwen Zhu,et al.  Talking Face Generation by Conditional Recurrent Adversarial Network , 2018, IJCAI.

[10]  Maja Pantic,et al.  Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[11]  Takaaki Kuratate,et al.  Linking facial animation, head motion and speech acoustics , 2002, J. Phonetics.

[12]  Anuj Srivastava,et al.  Statistical shape analysis: clustering, learning, and testing , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Chenliang Xu,et al.  Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Patrick Pérez,et al.  VDub: Modifying Face Video of Actors for Plausible Visual Alignment to a Dubbed Audio Track , 2015, Comput. Graph. Forum.

[15]  Francesc Moreno-Noguer,et al.  GANimation: Anatomically-aware Facial Animation from a Single Image , 2018, ECCV.

[16]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[17]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[18]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[19]  Chenliang Xu,et al.  Deep Cross-Modal Audio-Visual Generation , 2017, ACM Multimedia.

[20]  Lina J. Karam,et al.  A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection , 2009, 2009 International Workshop on Quality of Multimedia Experience.

[21]  Siwei Lyu,et al.  In Ictu Oculi: Exposing AI Generated Fake Face Videos by Detecting Eye Blinking , 2018, ArXiv.

[22]  Heloir,et al.  The Uncanny Valley , 2019, The Animation Studies Reader.

[23]  Joon Son Chung,et al.  You said that? , 2017, BMVC.

[24]  Chenliang Xu,et al.  Lip Movements Generation at a Glance , 2018, ECCV.

[25]  Maja Pantic,et al.  End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs , 2019, CVPR Workshops.

[26]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[28]  Louis-Philippe Morency,et al.  OpenFace 2.0: Facial Behavior Analysis Toolkit , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[29]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[31]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..