GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation

Generating talking person portraits with arbitrary speech audio is a crucial problem in the field of digital human and metaverse. A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. Recently, neural radiance field (NeRF) has become a popular rendering technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video. However, there still exist several challenges for NeRF-based methods: 1) as for the lip synchronization, it is hard to generate a long facial motion sequence of high temporal consistency and audio-lip accuracy; 2) as for the video quality, due to the limited data used to train the renderer, it is vulnerable to out-of-domain input condition and produce bad rendering results occasionally; 3) as for the system efficiency, the slow training and inference speed of the vanilla NeRF severely obstruct its usage in real-world applications. In this paper, we propose GeneFace++ to handle these challenges by 1) utilizing the pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process; 2) proposing a landmark locally linear embedding method to regulate the outliers in the predicted motion sequence to avoid robustness issues; 3) designing a computationally efficient NeRF-based motion-to-video renderer to achieves fast training and real-time inference. With these settings, GeneFace++ becomes the first NeRF-based method that achieves stable and real-time talking face generation with generalized audio-lip synchronization. Extensive experiments show that our method outperforms state-of-the-art baselines in terms of subjective and objective evaluation. Video samples are available at https://genefaceplusplus.github.io .

[1]  Zhenhui Ye,et al.  GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis , 2023, ICLR.

[2]  Anni Tang,et al.  Memories are One-to-Many Mapping Alleviators in Talking Face Generation , 2022, IEEE transactions on pattern analysis and machine intelligence.

[3]  Gang Zeng,et al.  Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition , 2022, ArXiv.

[4]  Jiwen Lu,et al.  Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis , 2022, ECCV.

[5]  Xiaoguang Han,et al.  Expressive Talking Head Generation with Granular Audio-Visual Control , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Wayne Wu,et al.  EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model , 2022, SIGGRAPH.

[7]  Zhenhui Ye,et al.  SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech , 2022, IJCAI.

[8]  Yujiu Yang,et al.  StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN , 2022, ECCV.

[9]  Bolei Zhou,et al.  Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation , 2022, ECCV.

[10]  T. Müller,et al.  Instant neural graphics primitives with a multiresolution hash encoding , 2022, ACM Trans. Graph..

[11]  Shalini De Mello,et al.  Efficient Geometry-aware 3D Generative Adversarial Networks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ligang Liu,et al.  HeadNeRF: A Realtime NeRF-based Parametric Head Model , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Xun Cao,et al.  MoFaNeRF: Morphable Facial Neural Radiance Field , 2021, ECCV.

[14]  Xu Tan,et al.  Transformer-S2A: Robust and Efficient Speech-to-Animation , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Haozhe Wu,et al.  Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis , 2021, ACM Multimedia.

[16]  Jinxiang Chai,et al.  Live speech portraits , 2021, ACM Trans. Graph..

[17]  Madhukar Budagavi,et al.  FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Hideki Koike,et al.  Speech2Talking-Face: Inferring and Driving a Face with Synchronized Audio-Visual Representation , 2021, IJCAI.

[19]  Jonathan T. Barron,et al.  HyperNeRF , 2021, ACM Trans. Graph..

[20]  Ruslan Salakhutdinov,et al.  Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training? , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  H. Bao,et al.  AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Lingyun Yu,et al.  Multimodal Inputs Driven Talking Face Generation With Spatial–Temporal Dependency , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[23]  Justus Thies,et al.  Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jiajun Wu,et al.  pi-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Francesc Moreno-Noguer,et al.  D-NeRF: Neural Radiance Fields for Dynamic Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jonathan T. Barron,et al.  Nerfies: Deformable Neural Radiance Fields , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  C. V. Jawahar,et al.  A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild , 2020, ACM Multimedia.

[28]  Lin Gao,et al.  DeepFaceDrawing: deep generation of face images from sketches , 2020, ACM Trans. Graph..

[29]  Haitian Zheng,et al.  What comprises a good talking-head video generation?: A Survey and Benchmark , 2020, ArXiv.

[30]  Yang Zhou,et al.  MakeltTalk , 2020, ACM Trans. Graph..

[31]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[32]  Hujun Bao,et al.  Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose , 2020, 2002.10137.

[33]  Justus Thies,et al.  Neural Voice Puppetry: Audio-driven Facial Reenactment , 2019, ECCV.

[34]  Lingyun Wu,et al.  MaskGAN: Towards Diverse and Interactive Facial Image Manipulation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Joon Son Chung,et al.  You Said That?: Synthesising Talking Faces from Audio , 2019, International Journal of Computer Vision.

[36]  Chenliang Xu,et al.  Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[38]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[39]  Chenliang Xu,et al.  Lip Movements Generation at a Glance , 2018, ECCV.

[40]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[41]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[42]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[43]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[45]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[46]  W. Thompson,et al.  Facial expressions of singers influence perceived pitch relations , 2010, Psychonomic bulletin & review.

[47]  Sami Romdhani,et al.  A 3D Face Model for Pose and Illumination Invariant Face Recognition , 2009, 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.

[48]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.