Talking Head from Speech Audio using a Pre-trained Image Generator

We propose a novel method for generating high-resolution videos of talking-heads from speech audio and a single 'identity' image. Our method is based on a convolutional neural network model that incorporates a pre-trained StyleGAN generator. We model each frame as a point in the latent space of StyleGAN so that a video corresponds to a trajectory through the latent space. Training the network is in two stages. The first stage is to model trajectories in the latent space conditioned on speech utterances. To do this, we use an existing encoder to invert the generator, mapping from each video frame into the latent space. We train a recurrent neural network to map from speech utterances to displacements in the latent space of the image generator. These displacements are relative to the back-projection into the latent space of an identity image chosen from the individuals depicted in the training dataset. In the second stage, we improve the visual quality of the generated videos by tuning the image generator on a single image or a short video of any chosen identity. We evaluate our model on standard measures (PSNR, SSIM, FID and LMD) and show that it significantly outperforms recent state-of-the-art methods on one of two commonly used datasets and gives comparable performance on the other. Finally, we report on ablation experiments that validate the components of the model. The code and videos from experiments can be found at https://mohammedalghamdi.github.io/talking-heads-acm-mm/

[1]  Mohamed Elhoseiny,et al.  StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2 , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Daniel Cohen-Or,et al.  Pivotal Tuning for Latent-based Editing of Real Images , 2021, ACM Trans. Graph..

[3]  Chen Change Loy,et al.  Everybody’s Talkin’: Let Me Talk as You Want , 2020, IEEE Transactions on Information Forensics and Security.

[4]  Madhukar Budagavi,et al.  FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Changjie Fan,et al.  Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion , 2021, IJCAI.

[6]  Christian Theobalt,et al.  StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN , 2021, BMVC.

[7]  Jaakko Lehtinen,et al.  Alias-Free Generative Adversarial Networks , 2021, NeurIPS.

[8]  Yu Ding,et al.  Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Vivek Kwatra,et al.  LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Dimitris N. Metaxas,et al.  A Good Image Generator Is What You Need for High-Resolution Video Synthesis , 2021, ICLR.

[11]  Moustafa Meshry,et al.  Learned Spatial Representations for Few-shot Talking-Head Synthesis , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Chen Change Loy,et al.  Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Xun Cao,et al.  Audio-Driven Emotional Video Portraits , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Daniel Cohen-Or,et al.  Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Dipanjan Das,et al.  Speech-Driven Facial Animation Using Cascaded GANs for Learning of Motion and Texture , 2020, ECCV.

[16]  C. V. Jawahar,et al.  A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild , 2020, ACM Multimedia.

[17]  Yang Zhou,et al.  MakeltTalk , 2020, ACM Trans. Graph..

[18]  Aaron Hertzmann,et al.  GANSpace: Discovering Interpretable GAN Controls , 2020, NeurIPS.

[19]  Justus Thies,et al.  Neural Voice Puppetry: Audio-driven Facial Reenactment , 2019, ECCV.

[20]  Tero Karras,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Bolei Zhou,et al.  Interpreting the Latent Space of GANs for Semantic Face Editing , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yu Qiao,et al.  MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation , 2020, ECCV.

[23]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[24]  Maja Pantic,et al.  Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[25]  Chenliang Xu,et al.  Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Peter Wonka,et al.  Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Andreas Rössler,et al.  FaceForensics++: Learning to Detect Manipulated Facial Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[30]  M. Nießner,et al.  ForensicTransfer: Weakly-supervised Domain Adaptation for Forgery Detection , 2018, ArXiv.

[31]  Maja Pantic,et al.  End-to-End Speech-Driven Facial Animation with Temporal GANs , 2018, BMVC.

[32]  Chenliang Xu,et al.  Lip Movements Generation at a Glance , 2018, ECCV.

[33]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[35]  Joon Son Chung,et al.  You said that? , 2017, BMVC.

[36]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[39]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.