论文信息 - FaceFormer: Speech-Driven 3D Facial Animation with Transformers

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data. Prior works typically focus on learning phoneme-level features of short audio windows with limited context, occasionally resulting in inaccurate lip movements. To tackle this limitation, we propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes. To cope with the data scarcity issue, we integrate the self-supervised pre-trained speech representations. Also, we devise two biased attention mechanisms well suited to this specific task, including the biased crossmodal multi-head (MH) attention and the biased causal MH self-attention with a periodic positional encoding strategy. The former effectively aligns the audio-motion modalities, whereas the latter offers abilities to generalize to longer audio sequences. Extensive experiments and a perceptual user study show that our approach outperforms the existing state-of-the-arts. We encourage watching the video1. The code will be made available.

Taku Komura | Jun Saito | Zhaojiang Lin | Wenping Wang | Yingruo Fan

[1] Zhenfeng Fan,et al. 3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head , 2021, ArXiv.

[2] Anima Anandkumar,et al. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[3] Dipanjan Das,et al. Speech-Driven Facial Animation Using Cascaded GANs for Learning of Motion and Texture , 2020, ECCV.

[4] Mark Chen,et al. Generative Pretraining From Pixels , 2020, ICML.

[5] Shiyu Chang,et al. TransGAN: Two Transformers Can Make One Strong GAN , 2021, ArXiv.

[6] Chen Change Loy,et al. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Arman Cohan,et al. Longformer: The Long-Document Transformer , 2020, ArXiv.

[8] Adam Finkelstein,et al. Text-based editing of talking-head video , 2019, ACM Trans. Graph..

[9] Manuel Kaufmann,et al. A Spatio-temporal Transformer for 3D Human Motion Prediction , 2020, 2021 International Conference on 3D Vision (3DV).

[10] Lijuan Wang,et al. End-to-End Human Pose and Mesh Reconstruction with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Lvdi Wang,et al. Speech-driven facial animation with spectral gathering and temporal attention , 2022, Frontiers Comput. Sci..

[12] Joon Son Chung,et al. Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[13] David A. Ross,et al. Learn to Dance with AIST++: Music Conditioned 3D Dance Generation , 2021, ArXiv.

[14] Joon Son Chung,et al. You said that? , 2017, BMVC.

[15] Ralph R. Martin,et al. PCT: Point cloud transformer , 2020, Computational Visual Media.

[16] Mathis Petrovich,et al. Action-Conditioned 3D Human Motion Synthesis with Transformer VAE , 2021, ArXiv.

[17] Hans-Peter Seidel,et al. Learning Speech-driven 3D Conversational Gestures from Video , 2021, IVA.

[18] Yaser Sheikh,et al. MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19] C. V. Jawahar,et al. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild , 2020, ACM Multimedia.

[20] Kun Zhou,et al. Real-time facial animation with image-based dynamic avatars , 2016, ACM Trans. Graph..

[21] Alexei Baevski,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[22] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[23] Fahad Shahbaz Khan,et al. Transformers in Vision: A Survey , 2021, ACM Comput. Surv..

[24] Subhransu Maji,et al. Visemenet , 2018, ACM Trans. Graph..

[25] Yisong Yue,et al. A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[26] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[27] Qiang Huo,et al. Video-audio driven real-time facial animation , 2015, ACM Trans. Graph..

[28] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[29] Patrick Pérez,et al. State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications , 2018, Comput. Graph. Forum.

[30] Dinesh Manocha,et al. Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents**This work has been supported in part by ARO Grants W911NF1910069 and W911NF1910315, and Intel. Code and additional materials available at: https://gamma.umd.edu/t2g , 2021, 2021 IEEE Virtual Reality and 3D User Interfaces (VR).

[31] Gaurav Mittal,et al. Animating Face using Disentangled Audio Representations , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32] Han Fang,et al. Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[33] Chenliang Xu,et al. Lip Movements Generation at a Glance , 2018, ECCV.

[34] Michael J. Black,et al. Capture, Learning, and Synthesis of 3D Speaking Styles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Quanfu Fan,et al. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[36] Pierre-Yves Oudeyer,et al. Transflower: probabilistic autoregressive dance generation with multimodal attention , 2021, ACM Trans. Graph..

[37] Chenliang Xu,et al. Talking-head Generation with Rhythmic Head Motion , 2020, ECCV.

[38] Dominic W. Massaro,et al. Animated speech: research progress and applications , 2001, AVSP.

[39] Andrew Zisserman,et al. X2Face: A network for controlling face generation by using images, audio, and pose codes , 2018, ECCV.

[40] Kun Li,et al. Geometry-Guided Dense Perspective Network for Speech-Driven Facial Animation , 2020, IEEE Transactions on Visualization and Computer Graphics.

[41] Jaakko Lehtinen,et al. Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[42] Paul Dixon,et al. Modality Dropout for Improved Performance-driven Talking Faces , 2020, ICMI.

[43] Hang Zhou,et al. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[44] Mike Lewis,et al. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , 2021, ArXiv.

[45] Justus Thies,et al. Neural Voice Puppetry: Audio-driven Facial Reenactment , 2020, ECCV.

[46] Hujun Bao,et al. Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose , 2020, 2002.10137.

[47] Hui Lin,et al. Talking Face Generation with Expression-Tailored Generative Adversarial Network , 2020, ACM Multimedia.

[48] Patrick Pérez,et al. Deep video portraits , 2018, ACM Trans. Graph..

[49] Maja Pantic,et al. Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[50] Frédéric H. Pighin,et al. Expressive speech-driven facial animation , 2005, TOGS.

[51] Sanja Fidler,et al. Learning to Generate Diverse Dance Motions with Transformer , 2020, ArXiv.

[52] Hai Xuan Pham,et al. Speech-Driven 3D Facial Animation with Implicit Emotional Awareness: A Deep Learning Approach , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[53] Yuyu Xu,et al. A Practical and Configurable Lip Sync Method for Games , 2013, MIG.

[54] Xun Cao,et al. Audio-Driven Emotional Video Portraits , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Moshe Mahler,et al. Dynamic units of visual speech , 2012, SCA '12.

[56] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.

[57] Mark Pauly,et al. Realtime performance-based facial animation , 2011, ACM Trans. Graph..

[58] Eugene Fiume,et al. JALI , 2016, ACM Trans. Graph..

[59] Dustin Tran,et al. Image Transformer , 2018, ICML.

[60] Vivek Kwatra,et al. LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62] Chenliang Xu,et al. Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[64] Lei Xie,et al. Photo-real talking head with deep bidirectional LSTM , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65] Luc Van Gool,et al. A 3-D Audio-Visual Corpus of Affective Communication , 2010, IEEE Transactions on Multimedia.

[66] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[67] Jihun Yu,et al. Realtime facial animation with on-the-fly correctives , 2013, ACM Trans. Graph..

[68] Hai Xuan Pham,et al. End-to-end Learning for 3D Facial Animation from Speech , 2018, ICMI.