Speech-driven Animation with Meaningful Behaviors

Conversational agents (CAs) play an important role in human computer interaction. Creating believable movements for CAs is challenging, since the movements have to be meaningful and natural, reflecting the coupling between gestures and speech. Studies in the past have mainly relied on rule-based or data-driven approaches. Rule-based methods focus on creating meaningful behaviors conveying the underlying message, but the gestures cannot be easily synchronized with speech. Data-driven approaches, especially speech-driven models, can capture the relationship between speech and gestures. However, they create behaviors disregarding the meaning of the message. This study proposes to bridge the gap between these two approaches overcoming their limitations. The approach builds a dynamic Bayesian network (DBN), where a discrete variable is added to constrain the behaviors on the underlying constraint. The study implements and evaluates the approach with two constraints: discourse functions and prototypical behaviors. By constraining on the discourse functions (e.g., questions), the model learns the characteristic behaviors associated with a given discourse class learning the rules from the data. By constraining on prototypical behaviors (e.g., head nods), the approach can be embedded in a rule-based system as a behavior realizer creating trajectories that are timely synchronized with speech. The study proposes a DBN structure and a training approach that (1) models the cause-effect relationship between the constraint and the gestures, (2) initializes the state configuration models increasing the range of the generated behaviors, and (3) captures the differences in the behaviors across constraints by enforcing sparse transitions between shared and exclusive states per constraint. Objective and subjective evaluations demonstrate the benefits of the proposed approach over an unconstrained model.

[1]  Satoshi Takahashi,et al.  Weighted distance measures for efficient reduction of Gaussian mixture components in HMM-based acoustic model , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  S. Levine,et al.  Gesture controllers , 2010, ACM Trans. Graph..

[3]  Qiang Liu,et al.  Aggregating Ordinal Labels from Crowds by Minimax Conditional Entropy , 2014, ICML.

[4]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[5]  Carlos Busso,et al.  Joint Learning of Speech-Driven Facial Motion with Bidirectional Long-Short Term Memory , 2017, IVA.

[6]  Hoon-Young Cho,et al.  A New Distance Measure for a Variable-Sized Acoustic Model Based on MDL Technique , 2010 .

[7]  Stacy Marsella,et al.  SmartBody: behavior realization for embodied conversational agents , 2008, AAMAS.

[8]  Alessandro De Gloria,et al.  Discovering the European Heritage Through the ChiKho Educational Web Game , 2005, INTETAIN.

[9]  Carlos Busso,et al.  Generating Human-Like Behaviors Using Joint, Speech-Driven Models for Conversational Agents , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Yang Liu,et al.  MSP-AVATAR corpus: Motion capture recordings to study the role of discourse functions in the design of intelligent virtual agents , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[11]  Zhigang Deng,et al.  Live Speech Driven Head-and-Eye Motion Generators , 2012, IEEE Transactions on Visualization and Computer Graphics.

[12]  Carlos Busso,et al.  Retrieving Target Gestures Toward Speech Driven Animation with Meaningful Behaviors , 2015, ICMI.

[13]  Kwok-Wai Cheung,et al.  Learning Hidden Markov Model Topology Based on KL Divergence for Information Extraction , 2004, PAKDD.

[14]  Dan Jurafsky,et al.  Speaker movement correlates with prosodic indicators of engagement , 2014 .

[15]  Jessica K. Hodgins,et al.  Aligned Cluster Analysis for temporal segmentation of human motion , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[16]  Algirdas Pakstas,et al.  MPEG-4 Facial Animation: The Standard,Implementation and Applications , 2002 .

[17]  Carlos Busso,et al.  Head Motion Generation , 2016 .

[18]  C. Pelachaud,et al.  GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT , 2005 .

[19]  Yang Liu,et al.  Meaningful head movements driven by emotional synthetic speech , 2017, Speech Commun..

[20]  Carlos Busso,et al.  Interrelation Between Speech and Facial Gestures in Emotional Utterances: A Single Subject Study , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Yang Liu,et al.  Speech-Driven Animation Constrained by Appropriate Discourse Functions , 2014, ICMI.

[22]  Xu Li,et al.  Low level descriptors based DBLSTM bottleneck feature for speech driven talking avatar , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[24]  Volker Strom,et al.  Visual prosody: facial movements accompanying speech , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[25]  Stacy Marsella,et al.  How to Train Your Avatar: A Data Driven Approach to Gesture Generation , 2011, IVA.

[26]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[27]  Frank K. Soong,et al.  A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  J. Gilbert,et al.  Virtual shopping agents , 2014 .

[29]  Anton Nijholt,et al.  Presenting in Virtual Worlds: Towards an Architecture for a 3D Presenter Explaining 2D-Presented Information , 2005, INTETAIN.

[30]  Stefan Kopp,et al.  Towards a Common Framework for Multimodal Generation: The Behavior Markup Language , 2006, IVA.

[31]  Hiroshi Shimodaira,et al.  Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis , 2016, IVA.

[32]  Zhigang Deng,et al.  Rigid Head Motion in Expressive Speech Animation: Analysis and Synthesis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Stefan Kopp,et al.  Synthesizing multimodal utterances for conversational agents , 2004, Comput. Animat. Virtual Worlds.

[34]  Mark Steedman,et al.  Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents , 1994, SIGGRAPH.

[35]  Yuyu Xu,et al.  Virtual character performance from speech , 2013, SCA '13.

[36]  Carlos Busso,et al.  Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks , 2018, IEEE Transactions on Affective Computing.

[37]  M. Studdert-Kennedy Hand and Mind: What Gestures Reveal About Thought. , 1994 .

[38]  Ron Artstein,et al.  Crowdsourcing micro-level multimedia annotations: the challenges of evaluation and interface , 2012, CrowdMM '12.

[39]  Stacy Marsella,et al.  Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach , 2015, IVA.

[40]  Zhigang Deng,et al.  Natural head motion synthesis driven by acoustic prosodic features , 2005, Comput. Animat. Virtual Worlds.

[41]  Engin Erzin,et al.  Multimodal analysis of speech prosody and upper body gestures using hidden semi-Markov models , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  Matthew Stone,et al.  Speaking with hands: creating animated conversational characters from recordings of human performance , 2004, ACM Trans. Graph..

[43]  Carlos Busso,et al.  Novel Realizations of Speech-Driven Head Movements with Generative Adversarial Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Michael Kipp,et al.  Gesture generation by imitation: from human behavior to computer character animation , 2005 .

[45]  Carlos Busso,et al.  Expressive Speech-Driven Lip Movements with Multitask Learning , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[46]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[47]  M. Mancini,et al.  An expressive ECA showing complex emotions , 2007 .

[48]  Carlos Busso,et al.  Head Motion Generation with Synthetic Speech: A Data Driven Approach , 2016, INTERSPEECH.

[49]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[50]  Mary Ellen Foster,et al.  Comparing Rule-Based and Data-Driven Selection of Facial Displays , 2007 .