Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

Speech-driven 3D facial animation with accurate lip synchronization has been widely studied. However, synthesizing realistic motions for the entire face during speech has rarely been explored. In this work, we present a joint audiotext model to capture the contextual information for expressive speech-driven 3D facial animation. The existing datasets are collected to cover as many different phonemes as possible instead of sentences, thus limiting the capability of the audio-based model to learn more diverse contexts. To address this, we propose to leverage the contextual text embeddings extracted from the powerful pre-trained language model that has learned rich contextual representations from large-scale text data. Our hypothesis is that the text features can disambiguate the variations in upper face expressions, which are not strongly correlated with the audio. In contrast to prior approaches which learn phoneme-level features from the text, we investigate the high-level contextual text features for speech-driven 3D facial animation. We show that the combined acoustic and textual modalities can synthesize realistic facial expressions while maintaining audio-lip synchronization. We conduct the quantitative and qualitative evaluations as well as the perceptual user study. The results demonstrate the superior performance of our model against existing stateof-the-art approaches.

[1]  Dominic W. Massaro,et al.  Animated speech: research progress and applications , 2001, AVSP.

[2]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[3]  Erik Cambria,et al.  Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Zhenfeng Fan,et al.  3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head , 2021, ArXiv.

[6]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[8]  Neal P. Fox,et al.  Speaker-normalized sound representations in the human auditory cortex , 2019, Nature Communications.

[9]  P. Ekman,et al.  EMFACS-7: Emotional Facial Action Coding System , 1983 .

[10]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[11]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[12]  Luc Van Gool,et al.  A 3-D Audio-Visual Corpus of Affective Communication , 2010, IEEE Transactions on Multimedia.

[13]  Yuyu Xu,et al.  A Practical and Configurable Lip Sync Method for Games , 2013, MIG.

[14]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[15]  Subhransu Maji,et al.  Visemenet , 2018, ACM Trans. Graph..

[16]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[17]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[18]  Gustav Eje Henter,et al.  Gesticulator: A framework for semantically-aware speech-driven gesture generation , 2020, ICMI.

[19]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[20]  Michael J. Black,et al.  Capture, Learning, and Synthesis of 3D Speaking Styles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[22]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[23]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[24]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[25]  Yaser Sheikh,et al.  MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Eugene Fiume,et al.  JALI , 2016, ACM Trans. Graph..

[27]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[28]  Youngwoo Yoon,et al.  Speech gesture generation from the trimodal context of text, audio, and speaker identity , 2020, ACM Trans. Graph..