Gesture generation with low-dimensional embeddings

There is a growing demand for embodied agents capable of engaging in face-to-face dialog using the same verbal and nonverbal behavior that people use. The focus of our work is generating coverbal hand gestures for these agents, gestures coupled to the content and timing of speech. A common approach to achieve this is to use motion capture of an actor or hand-crafted animations for each utterance. An alternative machine learning approach that saves development effort is to learn a general gesture controller that can generate behavior for novel utterances. However learning a direct mapping from speech to gesture movement faces the complexity of inferring the relation between the two time series of speech and gesture motion. We present a novel machine learning approach that decomposes the overall learning problem into learning two mappings: from speech to a gestural annotation and from gestural annotation to gesture motion. The combined model learns to synthesize natural gesture animation from speech audio. We assess the quality of generated animations by comparing them with the result generated by a previous approach that learns a direct mapping. Results from a human subject study show that our framework is perceived to be significantly better.

[1]  Stacy Marsella,et al.  Nonverbal Behavior Generator for Embodied Conversational Agents , 2006, IVA.

[2]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[3]  Stefan Kopp,et al.  Individualized Gesture Production in Embodied Conversational Agents , 2012, Human-Computer Interaction: The Agency Perspective.

[4]  Neil D. Lawrence,et al.  Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models , 2005, J. Mach. Learn. Res..

[5]  S. Levine,et al.  Gesture controllers , 2010, ACM Trans. Graph..

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  A. Murat Tekalp,et al.  Analysis of Head Gesture and Prosody Patterns for Prosody-Driven Head-Gesture Animation , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Carol O'Sullivan,et al.  Seeing is believing: body motion dominates in multisensory conversations , 2010, SIGGRAPH 2010.

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  Zhigang Deng,et al.  Rigid Head Motion in Expressive Speech Animation: Analysis and Synthesis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Ahmed M. Elgammal,et al.  Separating style and content on a nonlinear manifold , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[13]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[14]  C. O'Sullivan,et al.  Seeing is believing: body motion dominates in multisensory conversations , 2010, ACM Trans. Graph..

[15]  Stacy Marsella,et al.  SmartBody: behavior realization for embodied conversational agents , 2008, AAMAS.

[16]  Sergey Levine,et al.  Real-time prosody-driven synthesis of body language , 2009, ACM Trans. Graph..

[17]  Sergey Levine,et al.  Continuous character control with low-dimensional embeddings , 2012, ACM Trans. Graph..

[18]  Stacy Marsella,et al.  How to Train Your Avatar: A Data Driven Approach to Gesture Generation , 2011, IVA.

[19]  Hans-Peter Seidel,et al.  Annotated New Text Engine Animation Animation Lexicon Animation Gesture Profiles MR : . . . JL : . . . Gesture Generation Video Annotated Gesture Script , 2007 .

[20]  Lucas Kovar,et al.  Motion Graphs , 2002, ACM Trans. Graph..

[21]  Rashid Ansari,et al.  Multimodal signal analysis of prosody and hand motion: Temporal correlation of speech and gestures , 2002, 2002 11th European Signal Processing Conference.

[22]  M WangJack,et al.  Gaussian Process Dynamical Models for Human Motion , 2008 .

[23]  Matthew Stone,et al.  Speaking with hands: creating animated conversational characters from recordings of human performance , 2004, ACM Trans. Graph..

[24]  Friedhelm Schwenker,et al.  Investigating fuzzy-input fuzzy-output support vector machines for robust voice quality classification , 2013, Comput. Speech Lang..

[25]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[26]  Geoffrey E. Hinton,et al.  Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[27]  Stacy Marsella,et al.  Subjective Optimization , 2012, IVA.