How to Train Your Avatar: A Data Driven Approach to Gesture Generation

The ability to gesture is key to realizing virtual characters that can engage in face-to-face interaction with people. Many applications take an approach of predefining possible utterances of a virtual character and building all the gesture animations needed for those utterances. We can save effort on building a virtual human if we can construct a general gesture controller that will generate behavior for novel utterances. Because the dynamics of human gestures are related to the prosody of speech, in this work we propose a model to generate gestures based on prosody. We then assess the naturalness of the animations by comparing them against human gestures. The evaluation results were promising, human judgments show no significant difference between our generated gestures and human gestures and the generated gestures were judged as significantly better than real human gestures from a different utterance.

[1]  B. Schölkopf,et al.  Modeling Human Motion Using Binary Latent Variables , 2007 .

[2]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[3]  David McNeill,et al.  Language and Gesture: Frontmatter , 2000 .

[4]  Yihsiu Chen,et al.  Language and Gesture: Lexical gestures and lexical access: a process model , 2000 .

[5]  Zhigang Deng,et al.  Rigid Head Motion in Expressive Speech Animation: Analysis and Synthesis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Bobby Bodenheimer,et al.  Synthesis and evaluation of linear motion transitions , 2008, TOGS.

[7]  Stacy Marsella,et al.  Nonverbal Behavior Generator for Embodied Conversational Agents , 2006, IVA.

[8]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[9]  David A. Forsyth,et al.  Generalizing motion edits with Gaussian processes , 2009, ACM Trans. Graph..

[10]  Sergey Levine,et al.  Gesture controllers , 2010, SIGGRAPH 2010.

[11]  Sergey Levine,et al.  Real-time prosody-driven synthesis of body language , 2009, SIGGRAPH 2009.

[12]  Rashid Ansari,et al.  Multimodal signal analysis of prosody and hand motion: Temporal correlation of speech and gestures , 2002, 2002 11th European Signal Processing Conference.

[13]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[14]  John J. Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities , 1999 .

[15]  Taku Komura,et al.  Topology matching for fully automatic similarity estimation of 3D shapes , 2001, SIGGRAPH.

[16]  Stacy Marsella,et al.  A style controller for generating virtual human behaviors , 2011, AAMAS.

[17]  Hans-Peter Seidel,et al.  Annotated New Text Engine Animation Animation Lexicon Animation Gesture Profiles MR : . . . JL : . . . Gesture Generation Video Annotated Gesture Script , 2007 .

[18]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[19]  Geoffrey E. Hinton,et al.  Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[20]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[21]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[22]  A. Murat Tekalp,et al.  Analysis of Head Gesture and Prosody Patterns for Prosody-Driven Head-Gesture Animation , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Carol O'Sullivan,et al.  Seeing is believing: body motion dominates in multisensory conversations , 2010, SIGGRAPH 2010.

[24]  Matthew Stone,et al.  Speaking with hands: creating animated conversational characters from recordings of human performance , 2004, ACM Trans. Graph..