MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement

This paper presents a generic method for generating full facial 3D animation from speech. Existing approaches to audio-driven facial animation exhibit uncanny or static upper face animation, fail to produce accurate and plausible co-articulation or rely on person-specific models that limit their scalability. To improve upon existing models, we propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face. At the core of our approach is a categorical latent space for facial animation that disentangles audio-correlated and audio-uncorrelated information based on a novel cross-modality loss. Our approach ensures highly accurate lip motion, while also synthesizing plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion. We demonstrate that our approach outperforms several baselines and obtains state-of-the-art quality both qualitatively and quantitatively. A perceptual user study demonstrates that our approach is deemed more realistic than the current state-of-the-art in over 75% of cases. We recommend watching the supplemental video before reading the paper: https://github.com/ facebookresearch/meshtalk

[1]  Chenliang Xu,et al.  Generating Talking Face Landmarks from Speech , 2018, LVA/ICA.

[2]  Tony Ezzat,et al.  MikeTalk: a talking facial display based on morphing visemes , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[3]  L. Venkata Subramaniam,et al.  Using viseme based acoustic models for speech driven lip synthesis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[4]  Stephen D. Laycock,et al.  Joint Learning of Facial Expression and Head Pose from Speech , 2018, INTERSPEECH.

[5]  Jingwen Zhu,et al.  Talking Face Generation by Conditional Recurrent Adversarial Network , 2018, IJCAI.

[6]  Yaser Sheikh,et al.  VR facial animation via multiview image translation , 2019, ACM Trans. Graph..

[7]  Yaser Sheikh,et al.  Expressive Telepresence via Modular Codec Avatars , 2020, ECCV.

[8]  Philip H. S. Torr,et al.  Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models , 2019, NeurIPS.

[9]  Tony Ezzat,et al.  Visual Speech Synthesis by Morphing Visemes , 2000, International Journal of Computer Vision.

[10]  Michael J. Black,et al.  Capture, Learning, and Synthesis of 3D Speaking Styles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrew Zisserman,et al.  X2Face: A network for controlling face generation by using images, audio, and pose codes , 2018, ECCV.

[12]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[13]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[16]  Yaser Sheikh,et al.  Audio- and Gaze-driven Facial Animation of Codec Avatars , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17]  Zhigang Deng,et al.  Animating blendshape faces by cross-mapping motion capture data , 2006, I3D '06.

[18]  Eugene Fiume,et al.  JALI , 2016, ACM Trans. Graph..

[19]  Yaser Sheikh,et al.  Deep appearance models for face rendering , 2018, ACM Trans. Graph..

[20]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[21]  José Mario De Martino,et al.  Facial animation based on context-dependent visemes , 2006, Comput. Graph..

[22]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[23]  Luc Van Gool,et al.  Face animation based on observed 3D speech dynamics , 2001, Proceedings Computer Animation 2001. Fourteenth Conference on Computer Animation (Cat. No.01TH8596).

[24]  Du Tran,et al.  What Makes Training Multi-Modal Classification Networks Hard? , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[26]  C. V. Jawahar,et al.  A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild , 2020, ACM Multimedia.

[27]  Luc Van Gool,et al.  Speech Animation Using Viseme Space , 2002, VMV.

[28]  Justus Thies,et al.  Neural Voice Puppetry: Audio-driven Facial Reenactment , 2020, ECCV.

[29]  Frank K. Soong,et al.  High quality lip-sync animation for 3D photo-realistic talking head , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[31]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[32]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[33]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[34]  Christian Theobalt,et al.  Reconstruction of Personalized 3D Face Rigs from Monocular Video , 2016, ACM Trans. Graph..

[35]  Hai Xuan Pham,et al.  End-to-end Learning for 3D Facial Animation from Speech , 2018, ICMI.

[36]  Joon Son Chung,et al.  You said that? , 2017, BMVC.

[37]  Maja Pantic,et al.  End-to-End Speech-Driven Facial Animation with Temporal GANs , 2018, BMVC.

[38]  Subhransu Maji,et al.  Visemenet , 2018, ACM Trans. Graph..