Speech gesture generation from the trimodal context of text, audio, and speaker identity

For human-like agents, including virtual avatars and social robots, making proper gestures while speaking is crucial in human-agent interaction. Co-speech gestures enhance interaction experiences and make the agents look alive. However, it is difficult to generate human-like gestures due to the lack of understanding of how people gesture. Data-driven approaches attempt to learn gesticulation skills from human demonstrations, but the ambiguous and individual nature of gestures hinders learning. In this paper, we present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures. By incorporating a multimodal context and an adversarial training scheme, the proposed model outputs gestures that are human-like and that match with speech content and rhythm. We also introduce a new quantitative evaluation metric for gesture generation models. Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models. We further confirm that our model is able to work with synthesized audio in a scenario where contexts are constrained, and show that different gesture styles can be generated for the same speech by specifying different speaker identities in the style embedding space that is learned from videos of various speakers. All the code and data is available at https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.

[1]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection , 2018, J. Open Source Softw..

[2]  Stefan Kopp,et al.  Towards a Common Framework for Multimodal Generation: The Behavior Markup Language , 2006, IVA.

[3]  Yuyu Xu,et al.  Virtual character performance from speech , 2013, SCA '13.

[4]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[5]  Kate L. Howard,et al.  Why rate when you could compare? Using the “EloChoice” package to assess pairwise comparisons of perceived physical strength , 2018, PloS one.

[6]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[7]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[8]  Jitendra Malik,et al.  Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Michael Neff,et al.  Multi-objective adversarial gesture generation , 2019, MIG.

[10]  Yaser Sheikh,et al.  Towards Social Artificial Intelligence: Nonverbal Social Signal Prediction in a Triadic Interaction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  P. J. Huber Robust Estimation of a Location Parameter , 1964 .

[13]  M. Studdert-Kennedy Hand and Mind: What Gestures Reveal About Thought. , 1994 .

[14]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Michael Kipp,et al.  Gesture generation by imitation: from human behavior to computer character animation , 2005 .

[16]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Ali Borji,et al.  Pros and Cons of GAN Evaluation Measures , 2018, Comput. Vis. Image Underst..

[18]  Autumn B. Hostetter,et al.  Effects of personality and social situation on representational gesture production , 2012 .

[19]  Jonas Beskow,et al.  Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows , 2020, Comput. Graph. Forum.

[20]  Wei Chu,et al.  Extensions of Gaussian processes for ranking: semi-supervised and active learning , 2005 .

[21]  Matthias Scheutz,et al.  Hand Gestures and Verbal Acknowledgments Improve Human-Robot Rapport , 2017, ICSR.

[22]  Stefan Kopp,et al.  The Relation of Speech and Gestures: Temporal Synchrony Follows Semantic Synchrony , 2011 .

[23]  Naoshi Kaneko,et al.  Analyzing Input and Output Representations for Speech-Driven Gesture Generation , 2019, IVA.

[24]  Zheng Lin,et al.  Learning Sentiment-Specific Word Embedding via Global Sentiment Representation , 2018, AAAI.

[25]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[26]  Sjoerd van Steenkiste,et al.  FVD: A new Metric for Video Generation , 2019, DGS@ICLR.

[27]  P. Hagoort,et al.  Synchronization of speech and gesture: evidence for interaction in action. , 2014, Journal of experimental psychology. General.

[28]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[29]  Phillip Isola,et al.  On the "steerability" of generative adversarial networks , 2019, ICLR.

[30]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[31]  S. Levine,et al.  Gesture controllers , 2010, ACM Trans. Graph..

[32]  Bilge Mutlu,et al.  Learning-Based Modeling of Multimodal Behaviors for Humanlike Robots , 2014, 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[33]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[34]  Carlos Busso,et al.  Speech-driven Animation with Meaningful Behaviors , 2017, Speech Commun..

[35]  Hans-Peter Seidel,et al.  Annotated New Text Engine Animation Animation Lexicon Animation Gesture Profiles MR : . . . JL : . . . Gesture Generation Video Annotated Gesture Script , 2007 .

[36]  Dominik Roblek,et al.  Fréchet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms , 2018, ArXiv.

[37]  J. Burgoon,et al.  Nonverbal Behaviors, Persuasion, and Credibility , 1990 .

[38]  Stacy Marsella,et al.  Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach , 2015, IVA.

[39]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[40]  Stefan Kopp,et al.  Gesture and speech in interaction: An overview , 2014, Speech Commun..

[41]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[42]  Gustav Eje Henter,et al.  Gesticulator: A framework for semantically-aware speech-driven gesture generation , 2020, ICMI.

[43]  Alberto Menache,et al.  Understanding Motion Capture for Computer Animation and Video Games , 1999 .

[44]  Louis-Philippe Morency,et al.  Language2Pose: Natural Language Grounded Pose Forecasting , 2019, 2019 International Conference on 3D Vision (3DV).

[45]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[46]  Youngwoo Yoon,et al.  Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[47]  Motion, Interaction and Games , 2019, MIG.

[48]  Petra Himmel Hand And Mind What Gestures Reveal About Thought , 2016 .

[49]  Seunghoon Hong,et al.  Diversity-Sensitive Conditional Generative Adversarial Networks , 2019, ICLR.

[50]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[52]  Taewoo Kim,et al.  C-3PO: Cyclic-Three-Phase Optimization for Human-Robot Motion Retargeting based on Reinforcement Learning , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[53]  Sotaro Kita,et al.  How representational gestures help speaking , 2000 .

[54]  Andreas Aristidou,et al.  Folk Dance Evaluation Using Laban Movement Analysis , 2015, JOCCH.

[55]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[56]  Sriram Subramanian,et al.  The effects of robot-performed co-verbal gesture on listener behaviour , 2011, 2011 11th IEEE-RAS International Conference on Humanoid Robots.

[57]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[58]  Gabriel Skantze,et al.  Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs , 2018, ICMI.

[59]  D. McNeill Gesture and Thought , 2005 .

[60]  Wei Chu,et al.  Extensions of Gaussian Processes for Ranking : Semi-supervised and Active Learning , 2005 .

[61]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[62]  Dominik Roblek,et al.  Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms , 2019, INTERSPEECH.