论文信息 - Speech-Driven Conversational Agents using Conditional Flow-VAEs - 字舞流文

Speech-Driven Conversational Agents using Conditional Flow-VAEs

Automatic control of conversational agents has applications from animation, through human-computer interaction, to robotics. In interactive communication, an agent must move to express its own discourse, and also react naturally to incoming speech. In this paper we propose a Flow Variational Autoencoder (Flow-VAE) deep learning architecture for transforming conversational speech to body gesture, during both speaking and listening. The model uses a normalising flow to perform variational inference in an autoencoder framework and is a more expressive distribution than the Gaussian approximation of conventional variational autoencoders. Our model is non-deterministic, so can produce variations of plausible gestures for the same speech. Our evaluation demonstrates that our approach produces expressive body motion that is close to the ground truth using a fraction of the trainable parameters compared with previous state of the art.

Sarah Taylor | Iain Matthews | David Greenwood | Jonathan Windle | Iain Matthews | Sarah L. Taylor | David Greenwood | John T. Windle

[1] Jonas Beskow,et al. MoGlow , 2019, ACM Trans. Graph..

[2] Justine Cassell,et al. BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[3] J. Beskow,et al. Let's Face It: Probabilistic Multi-modal Interlocutor-aware Generation of Facial Gestures in Dyadic Settings , 2020, IVA.

[4] G. Henter,et al. The GENEA Challenge 2020: Benchmarking gesture-generation systems on common data , 2020 .

[5] A. Kendon. Some Relationships Between Body Motion and Speech , 1972 .

[6] Youngwoo Yoon,et al. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[7] Yaser Sheikh,et al. To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations , 2019, ICMI.

[8] Maurizio Mancini,et al. Implementing Expressive Gesture Synthesis for Embodied Conversational Agents , 2005, Gesture Workshop.

[9] Yukiko I. Nakano,et al. Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach , 2020, ECCV.

[10] Simon Alexanderson. The StyleGestures entry to the GENEA Challenge 2020 , 2020 .

[11] Alexander M. Rush,et al. Latent Normalizing Flows for Discrete Sequences , 2019, ICML.

[12] Michael Neff,et al. Multi-objective adversarial gesture generation , 2019, MIG.

[13] A. Kendon. Do Gestures Communicate? A Review , 1994 .

[14] Stefan Kopp,et al. Gesture and speech in interaction: An overview , 2014, Speech Commun..

[15] D. McNeill. So you think gestures are nonverbal , 1985 .

[16] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17] T. Komura,et al. CGVU: Semantics-guided 3D Body Gesture Synthesis , 2020 .

[18] Jasper Snoek,et al. On the relationship between Normalising Flows and Variational- and Denoising Autoencoders , 2019, DGS@ICLR.

[19] Sergey Levine,et al. Real-time prosody-driven synthesis of body language , 2009, ACM Trans. Graph..

[20] Mark Steedman,et al. Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents , 1994, SIGGRAPH.

[21] Stacy Marsella,et al. Natural Behavior of a Listening Agent , 2005, IVA.

[22] Shakir Mohamed,et al. Variational Inference with Normalizing Flows , 2015, ICML.

[23] D. McNeill. Hand and Mind , 1995 .

[24] Rachel McDonnell,et al. Investigating the use of recurrent motion modelling for speech gesture generation , 2018, IVA.

[25] Ira Kemelmacher-Shlizerman,et al. Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26] P. Schönemann,et al. A generalized solution of the orthogonal procrustes problem , 1966 .

[27] Stacy Marsella,et al. Gesture generation with low-dimensional embeddings , 2014, AAMAS.

[28] Matthew D. Hoffman,et al. Learning Deep Latent Gaussian Models with Markov Chain Monte Carlo , 2017, ICML.

[29] Jimei Yang,et al. Statistics‐based Motion Synthesis for Social Conversations , 2020, Comput. Graph. Forum.

[30] Yu Tsao,et al. Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[31] V. Korzun,et al. The FineMotion entry to the GENEA Challenge 2020 , 2020 .

[32] Bernt Schiele,et al. Conditional Flow Variational Autoencoders for Structured Sequence Prediction , 2019, ArXiv.

[33] Naoshi Kaneko,et al. Analyzing Input and Output Representations for Speech-Driven Gesture Generation , 2019, IVA.

[34] Sebastian Nowozin,et al. Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks , 2017, ICML.

[35] Yi Zhou,et al. On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Jan Peter de Ruiter,et al. The Interplay Between Gesture and Speech in the Production of Referring Expressions: Investigating the Tradeoff Hypothesis , 2012, Top. Cogn. Sci..

[37] Yuyu Xu,et al. Virtual character performance from speech , 2013, SCA '13.

[38] Carol O'Sullivan,et al. Seeing is believing: body motion dominates in multisensory conversations , 2010, SIGGRAPH 2010.

[39] Youngwoo Yoon,et al. Speech gesture generation from the trimodal context of text, audio, and speaker identity , 2020, ACM Trans. Graph..

[40] S. Levine,et al. Gesture controllers , 2010, ACM Trans. Graph..

[41] Prafulla Dhariwal,et al. Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[42] C. O'Sullivan,et al. Seeing is believing: body motion dominates in multisensory conversations , 2010, ACM Trans. Graph..

[43] Jonas Beskow,et al. Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows , 2020, Comput. Graph. Forum.

[44] David Duvenaud,et al. Inference Suboptimality in Variational Autoencoders , 2018, ICML.

[45] B. Butterworth,et al. Gesture, speech, and computational stages: a reply to McNeill. , 1989, Psychological review.

[46] Hans-Peter Seidel,et al. Annotated New Text Engine Animation Animation Lexicon Animation Gesture Profiles MR : . . . JL : . . . Gesture Generation Video Annotated Gesture Script , 2007 .

[47] Gordon Wetzstein,et al. Implicit Neural Representations with Periodic Activation Functions , 2020, NeurIPS.

[48] Stephen D. Laycock,et al. Predicting Head Pose in Dyadic Conversation , 2017, IVA.

[49] Stefan Roth,et al. Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings , 2020, ICLR.

[50] Jitendra Malik,et al. Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51] M. Studdert-Kennedy. Hand and Mind: What Gestures Reveal About Thought. , 1994 .