Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image. In this work, we tackle two key challenges: (i) producing natural head motions that match speech prosody, and (ii) maintaining the appearance of a speaker in a large head motion while stabilizing the non-face regions. We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN). In this way, the predicted head poses act as the low-frequency holistic movements of a talking head, thus allowing our latter network to focus on detailed facial movement generation. To depict the entire image motions arising from audio, we exploit a keypoint based dense motion field representation. Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image. As this keypoint based representation models the motions of facial regions, head, and backgrounds integrally, our method can better constrain the spatial and temporal consistency of the generated videos. Finally, an image generation network is employed to render photo-realistic talking-head videos from the estimated keypoint based motion fields and the input reference image. Extensive experiments demonstrate that our method produces videos with plausible head motions, synchronized facial expressions, and stable backgrounds and outperforms the state-of-the-art.

[1]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[2]  Louis-Philippe Morency,et al.  OpenFace 2.0: Facial Behavior Analysis Toolkit , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[3]  Xin Yu,et al.  Can We See More? Joint Frontalization and Hallucination of Unaligned Tiny Faces , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[5]  E. Kalogerakis,et al.  MakeItTalk: Speaker-Aware Talking Head Animation , 2020, ArXiv.

[6]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[7]  Chenliang Xu,et al.  Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Victor Lempitsky,et al.  Neural Head Reenactment with Latent Pose Descriptors , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jingwen Zhu,et al.  Talking Face Generation by Conditional Recurrent Adversarial Network , 2018, IJCAI.

[10]  Liang Liu,et al.  FReeNet: Multi-Identity Face Reenactment , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Xin Yu,et al.  Imagining the Unimaginable Faces by Deconvolutional Networks , 2018, IEEE Transactions on Image Processing.

[12]  Stephen D. Laycock,et al.  Predicting Head Pose from Speech with a Conditional Variational Autoencoder , 2017, INTERSPEECH.

[13]  Maja Pantic,et al.  Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[14]  Tal Hassner,et al.  FSGAN: Subject Agnostic Face Swapping and Reenactment , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Yaser Sheikh,et al.  Recycle-GAN: Unsupervised Video Retargeting , 2018, ECCV.

[16]  Xin Yu,et al.  Face Super-Resolution Guided by Facial Component Heatmaps , 2018, ECCV.

[17]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[18]  Martin Kersner,et al.  MarioNETte: Few-shot Face Reenactment Preserving Identity of Unseen Targets , 2019, AAAI.

[19]  Richard Hartley,et al.  Semantic Face Hallucination: Super-Resolving Very Low-Resolution Face Images with Supplementary Attributes , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Justus Thies,et al.  Neural Voice Puppetry: Audio-driven Facial Reenactment , 2020, ECCV.

[21]  Nicu Sebe,et al.  Animating Arbitrary Objects via Deep Motion Transfer , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Chenliang Xu,et al.  Talking-head Generation with Rhythmic Head Motion , 2020, ECCV.

[23]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[24]  Yu Ding,et al.  Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Xin Yu,et al.  Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation , 2021, AAAI.