FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data. Prior works typically focus on learning phoneme-level features of short audio windows with limited context, occasionally resulting in inaccurate lip movements. To tackle this limitation, we propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes. To cope with the data scarcity issue, we integrate the self-supervised pre-trained speech representations. Also, we devise two biased attention mechanisms well suited to this specific task, including the biased crossmodal multi-head (MH) attention and the biased causal MH self-attention with a periodic positional encoding strategy. The former effectively aligns the audio-motion modalities, whereas the latter offers abilities to generalize to longer audio sequences. Extensive experiments and a perceptual user study show that our approach outperforms the existing state-of-the-arts. We encourage watching the video1. The code will be made available.

[1]  Zhenfeng Fan,et al.  3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head , 2021, ArXiv.

[2]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[3]  Dipanjan Das,et al.  Speech-Driven Facial Animation Using Cascaded GANs for Learning of Motion and Texture , 2020, ECCV.

[4]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[5]  Shiyu Chang,et al.  TransGAN: Two Transformers Can Make One Strong GAN , 2021, ArXiv.

[6]  Chen Change Loy,et al.  Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[8]  Adam Finkelstein,et al.  Text-based editing of talking-head video , 2019, ACM Trans. Graph..

[9]  Manuel Kaufmann,et al.  A Spatio-temporal Transformer for 3D Human Motion Prediction , 2020, 2021 International Conference on 3D Vision (3DV).

[10]  Lijuan Wang,et al.  End-to-End Human Pose and Mesh Reconstruction with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Lvdi Wang,et al.  Speech-driven facial animation with spectral gathering and temporal attention , 2022, Frontiers Comput. Sci..

[12]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[13]  David A. Ross,et al.  Learn to Dance with AIST++: Music Conditioned 3D Dance Generation , 2021, ArXiv.

[14]  Joon Son Chung,et al.  You said that? , 2017, BMVC.

[15]  Ralph R. Martin,et al.  PCT: Point cloud transformer , 2020, Computational Visual Media.

[16]  Mathis Petrovich,et al.  Action-Conditioned 3D Human Motion Synthesis with Transformer VAE , 2021, ArXiv.

[17]  Hans-Peter Seidel,et al.  Learning Speech-driven 3D Conversational Gestures from Video , 2021, IVA.

[18]  Yaser Sheikh,et al.  MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  C. V. Jawahar,et al.  A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild , 2020, ACM Multimedia.

[20]  Kun Zhou,et al.  Real-time facial animation with image-based dynamic avatars , 2016, ACM Trans. Graph..

[21]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[22]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[23]  Fahad Shahbaz Khan,et al.  Transformers in Vision: A Survey , 2021, ACM Comput. Surv..

[24]  Subhransu Maji,et al.  Visemenet , 2018, ACM Trans. Graph..

[25]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[26]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[27]  Qiang Huo,et al.  Video-audio driven real-time facial animation , 2015, ACM Trans. Graph..

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Patrick Pérez,et al.  State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications , 2018, Comput. Graph. Forum.

[30]  Dinesh Manocha,et al.  Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents**This work has been supported in part by ARO Grants W911NF1910069 and W911NF1910315, and Intel. Code and additional materials available at: https://gamma.umd.edu/t2g , 2021, 2021 IEEE Virtual Reality and 3D User Interfaces (VR).

[31]  Gaurav Mittal,et al.  Animating Face using Disentangled Audio Representations , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[33]  Chenliang Xu,et al.  Lip Movements Generation at a Glance , 2018, ECCV.

[34]  Michael J. Black,et al.  Capture, Learning, and Synthesis of 3D Speaking Styles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Quanfu Fan,et al.  CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Pierre-Yves Oudeyer,et al.  Transflower: probabilistic autoregressive dance generation with multimodal attention , 2021, ACM Trans. Graph..

[37]  Chenliang Xu,et al.  Talking-head Generation with Rhythmic Head Motion , 2020, ECCV.

[38]  Dominic W. Massaro,et al.  Animated speech: research progress and applications , 2001, AVSP.

[39]  Andrew Zisserman,et al.  X2Face: A network for controlling face generation by using images, audio, and pose codes , 2018, ECCV.

[40]  Kun Li,et al.  Geometry-Guided Dense Perspective Network for Speech-Driven Facial Animation , 2020, IEEE Transactions on Visualization and Computer Graphics.

[41]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[42]  Paul Dixon,et al.  Modality Dropout for Improved Performance-driven Talking Faces , 2020, ICMI.

[43]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[44]  Mike Lewis,et al.  Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , 2021, ArXiv.

[45]  Justus Thies,et al.  Neural Voice Puppetry: Audio-driven Facial Reenactment , 2020, ECCV.

[46]  Hujun Bao,et al.  Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose , 2020, 2002.10137.

[47]  Hui Lin,et al.  Talking Face Generation with Expression-Tailored Generative Adversarial Network , 2020, ACM Multimedia.

[48]  Patrick Pérez,et al.  Deep video portraits , 2018, ACM Trans. Graph..

[49]  Maja Pantic,et al.  Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[50]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[51]  Sanja Fidler,et al.  Learning to Generate Diverse Dance Motions with Transformer , 2020, ArXiv.

[52]  Hai Xuan Pham,et al.  Speech-Driven 3D Facial Animation with Implicit Emotional Awareness: A Deep Learning Approach , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[53]  Yuyu Xu,et al.  A Practical and Configurable Lip Sync Method for Games , 2013, MIG.

[54]  Xun Cao,et al.  Audio-Driven Emotional Video Portraits , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[56]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[57]  Mark Pauly,et al.  Realtime performance-based facial animation , 2011, ACM Trans. Graph..

[58]  Eugene Fiume,et al.  JALI , 2016, ACM Trans. Graph..

[59]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[60]  Vivek Kwatra,et al.  LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62]  Chenliang Xu,et al.  Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[64]  Lei Xie,et al.  Photo-real talking head with deep bidirectional LSTM , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  Luc Van Gool,et al.  A 3-D Audio-Visual Corpus of Affective Communication , 2010, IEEE Transactions on Multimedia.

[66]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[67]  Jihun Yu,et al.  Realtime facial animation with on-the-fly correctives , 2013, ACM Trans. Graph..

[68]  Hai Xuan Pham,et al.  End-to-end Learning for 3D Facial Animation from Speech , 2018, ICMI.