Speech-Driven Conversational Agents using Conditional Flow-VAEs

Automatic control of conversational agents has applications from animation, through human-computer interaction, to robotics. In interactive communication, an agent must move to express its own discourse, and also react naturally to incoming speech. In this paper we propose a Flow Variational Autoencoder (Flow-VAE) deep learning architecture for transforming conversational speech to body gesture, during both speaking and listening. The model uses a normalising flow to perform variational inference in an autoencoder framework and is a more expressive distribution than the Gaussian approximation of conventional variational autoencoders. Our model is non-deterministic, so can produce variations of plausible gestures for the same speech. Our evaluation demonstrates that our approach produces expressive body motion that is close to the ground truth using a fraction of the trainable parameters compared with previous state of the art.

[1]  Jonas Beskow,et al.  MoGlow , 2019, ACM Trans. Graph..

[2]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[3]  J. Beskow,et al.  Let's Face It: Probabilistic Multi-modal Interlocutor-aware Generation of Facial Gestures in Dyadic Settings , 2020, IVA.

[4]  G. Henter,et al.  The GENEA Challenge 2020: Benchmarking gesture-generation systems on common data , 2020 .

[5]  A. Kendon Some Relationships Between Body Motion and Speech , 1972 .

[6]  Youngwoo Yoon,et al.  Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[7]  Yaser Sheikh,et al.  To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations , 2019, ICMI.

[8]  Maurizio Mancini,et al.  Implementing Expressive Gesture Synthesis for Embodied Conversational Agents , 2005, Gesture Workshop.

[9]  Yukiko I. Nakano,et al.  Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach , 2020, ECCV.

[10]  Simon Alexanderson The StyleGestures entry to the GENEA Challenge 2020 , 2020 .

[11]  Alexander M. Rush,et al.  Latent Normalizing Flows for Discrete Sequences , 2019, ICML.

[12]  Michael Neff,et al.  Multi-objective adversarial gesture generation , 2019, MIG.

[13]  A. Kendon Do Gestures Communicate? A Review , 1994 .

[14]  Stefan Kopp,et al.  Gesture and speech in interaction: An overview , 2014, Speech Commun..

[15]  D. McNeill So you think gestures are nonverbal , 1985 .

[16]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  T. Komura,et al.  CGVU: Semantics-guided 3D Body Gesture Synthesis , 2020 .

[18]  Jasper Snoek,et al.  On the relationship between Normalising Flows and Variational- and Denoising Autoencoders , 2019, DGS@ICLR.

[19]  Sergey Levine,et al.  Real-time prosody-driven synthesis of body language , 2009, ACM Trans. Graph..

[20]  Mark Steedman,et al.  Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents , 1994, SIGGRAPH.

[21]  Stacy Marsella,et al.  Natural Behavior of a Listening Agent , 2005, IVA.

[22]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[23]  D. McNeill Hand and Mind , 1995 .

[24]  Rachel McDonnell,et al.  Investigating the use of recurrent motion modelling for speech gesture generation , 2018, IVA.

[25]  Ira Kemelmacher-Shlizerman,et al.  Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  P. Schönemann,et al.  A generalized solution of the orthogonal procrustes problem , 1966 .

[27]  Stacy Marsella,et al.  Gesture generation with low-dimensional embeddings , 2014, AAMAS.

[28]  Matthew D. Hoffman,et al.  Learning Deep Latent Gaussian Models with Markov Chain Monte Carlo , 2017, ICML.

[29]  Jimei Yang,et al.  Statistics‐based Motion Synthesis for Social Conversations , 2020, Comput. Graph. Forum.

[30]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[31]  V. Korzun,et al.  The FineMotion entry to the GENEA Challenge 2020 , 2020 .

[32]  Bernt Schiele,et al.  Conditional Flow Variational Autoencoders for Structured Sequence Prediction , 2019, ArXiv.

[33]  Naoshi Kaneko,et al.  Analyzing Input and Output Representations for Speech-Driven Gesture Generation , 2019, IVA.

[34]  Sebastian Nowozin,et al.  Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks , 2017, ICML.

[35]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Jan Peter de Ruiter,et al.  The Interplay Between Gesture and Speech in the Production of Referring Expressions: Investigating the Tradeoff Hypothesis , 2012, Top. Cogn. Sci..

[37]  Yuyu Xu,et al.  Virtual character performance from speech , 2013, SCA '13.

[38]  Carol O'Sullivan,et al.  Seeing is believing: body motion dominates in multisensory conversations , 2010, SIGGRAPH 2010.

[39]  Youngwoo Yoon,et al.  Speech gesture generation from the trimodal context of text, audio, and speaker identity , 2020, ACM Trans. Graph..

[40]  S. Levine,et al.  Gesture controllers , 2010, ACM Trans. Graph..

[41]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[42]  C. O'Sullivan,et al.  Seeing is believing: body motion dominates in multisensory conversations , 2010, ACM Trans. Graph..

[43]  Jonas Beskow,et al.  Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows , 2020, Comput. Graph. Forum.

[44]  David Duvenaud,et al.  Inference Suboptimality in Variational Autoencoders , 2018, ICML.

[45]  B. Butterworth,et al.  Gesture, speech, and computational stages: a reply to McNeill. , 1989, Psychological review.

[46]  Hans-Peter Seidel,et al.  Annotated New Text Engine Animation Animation Lexicon Animation Gesture Profiles MR : . . . JL : . . . Gesture Generation Video Annotated Gesture Script , 2007 .

[47]  Gordon Wetzstein,et al.  Implicit Neural Representations with Periodic Activation Functions , 2020, NeurIPS.

[48]  Stephen D. Laycock,et al.  Predicting Head Pose in Dyadic Conversation , 2017, IVA.

[49]  Stefan Roth,et al.  Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings , 2020, ICLR.

[50]  Jitendra Malik,et al.  Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  M. Studdert-Kennedy Hand and Mind: What Gestures Reveal About Thought. , 1994 .