Learning Speech-driven 3D Conversational Gestures from Video

We propose the first approach to synthesize the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. Synthesis of conversational body gestures is a multi-modal problem since many similar gestures can plausibly accompany the same input speech. To synthesize plausible body gestures in this setting, we train a Generative Adversarial Network (GAN) based model that measures the plausibility of the generated sequences of 3D body motion when paired with the input audio features. We also contribute a new corpus that contains more than 33 hours of annotated data from in-the-wild videos of talking people. To this end, we apply state-of-the-art monocular approaches for 3D body and hand pose estimation as well as 3D face performance capture to the video corpus. In this way, we can train on orders of magnitude more data than previous algorithms that resort to complex in-studio motion capture solutions, and thereby train more expressive synthesis algorithms. Our experiments and user study show the state-of-the-art quality of our speech-synthesized full 3D character animations.

[1]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[2]  Hai Xuan Pham,et al.  End-to-end Learning for 3D Facial Animation from Speech , 2018, ICMI.

[3]  S. Goldin-Meadow,et al.  The role of gesture in communication and thinking , 1999, Trends in Cognitive Sciences.

[4]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[5]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[6]  Kenta Takeuchi,et al.  Creating a Gesture-Speech Dataset for Speech-Based Automatic Gesture Generation , 2017, HCI.

[7]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Simon Lucey,et al.  Deformable Model Fitting by Regularized Landmark Mean-Shift , 2010, International Journal of Computer Vision.

[9]  Jonas Beskow,et al.  Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows , 2020, Comput. Graph. Forum.

[10]  Carlos Busso,et al.  Speech-driven Animation with Meaningful Behaviors , 2017, Speech Commun..

[11]  Rachel McDonnell,et al.  Investigating the use of recurrent motion modelling for speech gesture generation , 2018, IVA.

[12]  Hans-Peter Seidel,et al.  Annotated New Text Engine Animation Animation Lexicon Animation Gesture Profiles MR : . . . JL : . . . Gesture Generation Video Annotated Gesture Script , 2007 .

[13]  Ira Kemelmacher-Shlizerman,et al.  Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Jan-Michael Frahm,et al.  Towards Fully Mobile 3D Face, Body, and Environment Capture Using Only Head-worn Cameras , 2018, IEEE Transactions on Visualization and Computer Graphics.

[15]  Christian Theobalt,et al.  Reconstruction of Personalized 3D Face Rigs from Monocular Video , 2016, ACM Trans. Graph..

[16]  Youngwoo Yoon,et al.  Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[17]  Taku Komura,et al.  A Deep Learning Framework for Character Motion Synthesis and Editing , 2016, ACM Trans. Graph..

[18]  Elisabeth André,et al.  The Persona Effect: How Substantial Is It? , 1998, BCS HCI.

[19]  Pascal Fua,et al.  XNect: Real-time Multi-Person 3D Motion Capture with a Single RGB Camera , 2020, SIGGRAPH 2020.

[20]  Qiang Huo,et al.  Video-audio driven real-time facial animation , 2015, ACM Trans. Graph..

[21]  Mark Steedman,et al.  Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents , 1994, SIGGRAPH.

[22]  Stefanos Zafeiriou,et al.  Synthesising 3D Facial Motion from “In-the-Wild” Speech , 2019, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[23]  Yukiko I. Nakano,et al.  Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach , 2020, ECCV.

[24]  Justine Cassell,et al.  Embodied conversational interface agents , 2000, CACM.

[25]  Sergey Levine,et al.  Real-time prosody-driven synthesis of body language , 2009, ACM Trans. Graph..

[26]  Sherman Wilcox,et al.  Language and Gesture , 2017 .

[27]  Christian Theobalt,et al.  Monocular Real-Time Hand Shape and Motion Capture Using Multi-Modal Data , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Michael Neff,et al.  Multi-objective adversarial gesture generation , 2019, MIG.

[29]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[30]  Naoshi Kaneko,et al.  Analyzing Input and Output Representations for Speech-Driven Gesture Generation , 2019, IVA.

[31]  Engin Erzin,et al.  Multimodal analysis of speech prosody and upper body gestures using hidden semi-Markov models , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Yuyu Xu,et al.  Virtual character performance from speech , 2013, SCA '13.

[33]  Hiroshi Shimodaira,et al.  Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis , 2016, IVA.

[34]  J. Cassell,et al.  Embodied conversational agents , 2000 .

[35]  Youngwoo Yoon,et al.  Speech gesture generation from the trimodal context of text, audio, and speaker identity , 2020, ACM Trans. Graph..

[36]  Taku Komura,et al.  Learning motion manifolds with convolutional autoencoders , 2015, SIGGRAPH Asia Technical Briefs.

[37]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[38]  Yang Liu,et al.  Speech-Driven Animation Constrained by Appropriate Discourse Functions , 2014, ICMI.

[39]  Manfred K. Warmuth,et al.  THE CMU SPHINX-4 SPEECH RECOGNITION SYSTEM , 2001 .

[40]  Stacy Marsella,et al.  Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach , 2015, IVA.

[41]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Stacy Marsella,et al.  How to Train Your Avatar: A Data Driven Approach to Gesture Generation , 2011, IVA.

[43]  Kazuhiko Sumi,et al.  Speech-to-Gesture Generation: A Challenge in Deep Learning Approach with Bi-Directional LSTM , 2017, HAI.

[44]  Gustav Eje Henter,et al.  Gesticulator: A framework for semantically-aware speech-driven gesture generation , 2020, ICMI.

[45]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Jitendra Malik,et al.  Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Kazuhiko Sumi,et al.  Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network , 2018, IVA.

[48]  Michael J. Black,et al.  Learning a model of facial shape and expression from 4D scans , 2017, ACM Trans. Graph..

[49]  Carlos Busso,et al.  Generating Human-Like Behaviors Using Joint, Speech-Driven Models for Conversational Agents , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  S. Levine,et al.  Gesture controllers , 2010, ACM Trans. Graph..

[51]  A. Kendon Gesture: Visible Action as Utterance , 2004 .

[52]  Michael J. Black,et al.  Capture, Learning, and Synthesis of 3D Speaking Styles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[54]  Stacy Marsella,et al.  Gesture generation with low-dimensional embeddings , 2014, AAMAS.

[55]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[56]  Yaser Sheikh,et al.  Talking With Hands 16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[58]  Yang Liu,et al.  Meaningful head movements driven by emotional synthetic speech , 2017, Speech Commun..