Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

Real-world talking faces often accompany with natural head movement. However, most existing talking face video generation methods only consider facial animation with fixed head pose. In this paper, we address this problem by proposing a deep neural network model that takes an audio signal A of a source person and a very short video V of a target person as input, and outputs a synthesized high-quality talking face video with personalized head pose (making use of the visual information in V), expression and lip synchronization (by considering both A and V). The most challenging issue in our work is that natural poses often cause in-plane and out-of-plane head rotations, which makes synthesized talking face video far from realistic. To address this challenge, we reconstruct 3D face animation and re-render it into synthesized frames. To fine tune these frames into realistic ones with smooth background transition, we propose a novel memory-augmented GAN module. By first training a general mapping based on a publicly available dataset and fine-tuning the mapping using the input short video of target person, we develop an effective strategy that only requires a small number of frames (about 300 frames) to learn personalized talking behavior including head pose. Extensive experiments and two user studies show that our method can generate high-quality (i.e., personalized head movements, expressions and good lip synchronization) talking face videos, which are naturally looking with more distinguishing head movement effects than the state-of-the-art methods.

[1]  Bailin Deng,et al.  3D Face Reconstruction With Geometry Details From a Single Image , 2017, IEEE Transactions on Image Processing.

[2]  Zhigang Deng,et al.  Rigid Head Motion in Expressive Speech Animation: Analysis and Synthesis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Daniel Cohen-Or,et al.  Bringing portraits to life , 2017, ACM Trans. Graph..

[4]  Jean N. Nazzaro,et al.  Auditory versus Visual Learning of Temporal Patterns. , 1970 .

[5]  Jingwen Zhu,et al.  Talking Face Generation by Conditional Recurrent Adversarial Network , 2018, IJCAI.

[6]  Ron Kimmel,et al.  Unrestricted Facial Geometry Reconstruction Using Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Antonio Camurri,et al.  Toward a Minimal Representation of Affective Gestures , 2011, IEEE Transactions on Affective Computing.

[8]  Joon Son Chung,et al.  You said that? , 2017, BMVC.

[9]  Chenliang Xu,et al.  Lip Movements Generation at a Glance , 2018, ECCV.

[10]  Sami Romdhani,et al.  A 3D Face Model for Pose and Illumination Invariant Face Recognition , 2009, 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.

[11]  Patrick Pérez,et al.  State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications , 2018, Comput. Graph. Forum.

[12]  Chao Yang,et al.  Realistic Dynamic Facial Textures from a Single Image Using GANs , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[14]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Andrew Zisserman,et al.  X2Face: A network for controlling face generation by using images, audio, and pose codes , 2018, ECCV.

[16]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[17]  Georgios Tzimiropoulos,et al.  Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Peter Robinson,et al.  Generalization of a Vision-Based Computational Model of Mind-Reading , 2005, ACII.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Gunhee Kim,et al.  Attend to You: Personalized Image Captioning with Context Sequence Memory Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Francesc Moreno-Noguer,et al.  GANimation: Anatomically-aware Facial Animation from a Single Image , 2018, ECCV.

[22]  Tae-Hyun Oh,et al.  Speech2Face: Learning the Face Behind a Voice , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Frédo Durand,et al.  The visual microphone , 2014, ACM Trans. Graph..

[24]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[25]  Xiaoming Liu,et al.  Nonlinear 3D Face Morphable Model , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Matan Sela,et al.  3D Face Reconstruction by Learning from Synthetic Data , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[27]  Xiangyu Zhu,et al.  Face Alignment in Full Pose Range: A 3D Total Solution , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Justus Thies,et al.  Face2Face: real-time face capture and reenactment of RGB videos , 2019, Commun. ACM.

[29]  Yiying Tong,et al.  FaceWarehouse: A 3D Facial Expression Database for Visual Computing , 2014, IEEE Transactions on Visualization and Computer Graphics.

[30]  Stephen D. Laycock,et al.  Joint Learning of Facial Expression and Head Pose from Speech , 2018, INTERSPEECH.

[31]  Gunhee Kim,et al.  Abstractive Summarization of Reddit Posts with Multi-level Memory Networks , 2018, NAACL.

[32]  Juyong Zhang,et al.  CNN-Based Real-Time Dense Face Reconstruction with Inverse-Rendered Photo-Realistic Face Images. , 2019, IEEE transactions on pattern analysis and machine intelligence.

[33]  Chenliang Xu,et al.  Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  William T. Freeman,et al.  Unsupervised Training for 3D Morphable Model Regression , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Justus Thies,et al.  Neural Voice Puppetry: Audio-driven Facial Reenactment , 2019, ECCV.

[36]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[37]  Lei Xie,et al.  Photo-real talking head with deep bidirectional LSTM , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Jaegul Choo,et al.  Coloring With Limited Data: Few-Shot Colorization via Memory Augmented Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[40]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[41]  Stephen D. Laycock,et al.  Predicting Head Pose from Speech with a Conditional Variational Autoencoder , 2017, INTERSPEECH.

[42]  Matan Sela,et al.  Learning Detailed Face Reconstruction from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[44]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[45]  Jiaolong Yang,et al.  Accurate 3D Face Reconstruction With Weakly-Supervised Learning: From Single Image to Image Set , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[46]  Jianfei Cai,et al.  CNN-Based Real-Time Dense Face Reconstruction with Inverse-Rendered Photo-Realistic Face Images , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Patrick Pérez,et al.  MoFA: Model-Based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Pat Hanrahan,et al.  An efficient representation for irradiance environment maps , 2001, SIGGRAPH.

[49]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[50]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[51]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Maja Pantic,et al.  Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[53]  C. Villani Topics in Optimal Transportation , 2003 .

[54]  Patrick Pérez,et al.  Deep video portraits , 2018, ACM Trans. Graph..

[55]  Jan Kautz,et al.  Few-shot Video-to-Video Synthesis , 2019, NeurIPS.

[56]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Siwei Zhang,et al.  One-shot Face Reenactment , 2019, BMVC.

[58]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[59]  Stefanos Zafeiriou,et al.  GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Justus Thies,et al.  FaceVR: Real-Time Gaze-Aware Facial Reenactment in Virtual Reality , 2018, ACM Trans. Graph..

[61]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[62]  Victor Lempitsky,et al.  Few-Shot Adversarial Learning of Realistic Neural Talking Head Models , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[63]  Christian Theobalt,et al.  Reconstruction of Personalized 3D Face Rigs from Monocular Video , 2016, ACM Trans. Graph..