Talking Face Generation with Expression-Tailored Generative Adversarial Network

A key of automatically generating vivid talking faces is to synthesize identity-preserving natural facial expressions beyond audio-lip synchronization, which usually need to disentangle the informative features from multiple modals and then fuse them together. In this paper, we propose an end-to-end Expression-Tailored Generative Adversarial Network (ET-GAN) to generate an expression enriched talking face video of arbitrary identity. Different from talking face generation based on identity image and audio, an expressional video of arbitrary identity serves as the expression source in our approach. Expression encoder is proposed to disentangle expression-tailored representation from the guiding expressional video, while audio encoder disentangles audio-lip representation. Instead of using single image as identity input, multi-image identity encoder is proposed by learning different views of faces and merging a unified representation. Multiple discriminators are exploited to keep both image-aware and the video-aware realistic details, including a spatial-temporal discriminator for visual continuity of expression synthesis and facial movements. We conduct extensive experimental evaluations on quantitative metrics, expression retention quality and audio-visual synchronization. The results show the effectiveness of our ET-GAN in generating high quality expressional talking face videos against existing state-of-the-arts.

[1]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[2]  Joon Son Chung,et al.  You Said That?: Synthesising Talking Faces from Audio , 2019, International Journal of Computer Vision.

[3]  Georgios Tzimiropoulos,et al.  How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Maja Pantic,et al.  Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[5]  Lingyun Yu,et al.  Multimodal Inputs Driven Talking Face Generation With Spatial–Temporal Dependency , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[6]  Shan Li,et al.  Deep Facial Expression Recognition: A Survey , 2018, IEEE Transactions on Affective Computing.

[7]  Francesc Moreno-Noguer,et al.  GANimation: Anatomically-aware Facial Animation from a Single Image , 2018, ECCV.

[8]  Yaser Sheikh,et al.  Supervision by Registration and Triangulation for Landmark Detection , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Kamran Ali,et al.  An Efficient Integration of Disentangled Attended Expression and Identity FeaturesFor Facial Expression Transfer andSynthesis , 2020, ArXiv.

[10]  Chenliang Xu,et al.  Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Nicu Sebe,et al.  Expression Conditional Gan for Facial Expression-to-Expression Translation , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[12]  Michael J. Black,et al.  Capture, Learning, and Synthesis of 3D Speaking Styles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Nicu Sebe,et al.  Every Smile is Unique: Landmark-Guided Diverse Smile Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[15]  Chuang Gan,et al.  Controllable Image-to-Video Translation: A Case Study on Facial Expression Generation , 2018, AAAI.

[16]  Xiaogang Wang,et al.  3D Human Pose Estimation in the Wild by Adversarial Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[18]  Mahadev Satyanarayanan,et al.  OpenFace: A general-purpose face recognition library with mobile applications , 2016 .

[19]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Maja Pantic,et al.  End-to-End Speech-Driven Facial Animation with Temporal GANs , 2018, BMVC.

[21]  Ragini Verma,et al.  CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.

[22]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Subhransu Maji,et al.  Visemenet , 2018, ACM Trans. Graph..

[24]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[25]  Shaohua Wan,et al.  A long video caption generation algorithm for big video data retrieval , 2019, Future Gener. Comput. Syst..

[26]  Xiaoming Liu,et al.  Disentangled Representation Learning GAN for Pose-Invariant Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[28]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[29]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Gui-Song Xia,et al.  Image Caption Generation with Part of Speech Guidance , 2017, Pattern Recognit. Lett..

[31]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[32]  Hao Zhu,et al.  High-Resolution Talking Face Generation via Mutual Information Approximation , 2018, ArXiv.

[33]  David Picard,et al.  2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Maja Pantic,et al.  Video-Driven Speech Reconstruction using Generative Adversarial Networks , 2019, INTERSPEECH.

[35]  Stefano Berretti,et al.  Dynamic Facial Expression Generation on Hilbert Hypersphere With Conditional Wasserstein Generative Adversarial Nets , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[37]  Yidong Li,et al.  Towards stabilizing facial landmark detection and tracking via hierarchical filtering: A new method , 2020, J. Frankl. Inst..

[38]  Chenliang Xu,et al.  Deep Cross-Modal Audio-Visual Generation , 2017, ACM Multimedia.

[39]  Jing Liao,et al.  CariGANs , 2018, ACM Trans. Graph..

[40]  Sanjay Singh,et al.  Image-Based Facial Expression Recognition Using Local Neighborhood Difference Binary Pattern , 2018, CVIP.

[41]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Joelle Pineau,et al.  Online Adaptative Curriculum Learning for GANs , 2018, AAAI.

[43]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Stefanos Zafeiriou,et al.  Face Video Generation from a Single Image and Landmarks , 2019, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[46]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[47]  Hongtao Lu,et al.  An Adversarial Approach to Hard Triplet Generation , 2018, ECCV.

[48]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[49]  Guo-Jun Qi,et al.  Loss-Sensitive Generative Adversarial Networks on Lipschitz Densities , 2017, International Journal of Computer Vision.

[50]  Bin Huang,et al.  Facial expression recognition based on deep convolution long short-term memory networks of double-channel weighted mixture , 2020, Pattern Recognit. Lett..