Real-Time Visual Prosody for Interactive Virtual Agents

Speakers accompany their speech with incessant, subtle head movements. It is important to implement such “visual prosody” in virtual agents, not only to make their behavior more natural, but also because it has been shown to help listeners understand speech. We contribute a visual prosody model for interactive virtual agents that shall be capable of having live, non-scripted interactions with humans and thus have to use Text-To-Speech rather than recorded speech. We present our method for creating visual prosody online from continuous TTS output, and we report results from three crowdsourcing experiments carried out to see if and to what extent it can help in enhancing the interaction experience with an agent.

[1]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[2]  Dirk Heylen,et al.  Head Gestures, Gaze and the Principles of Conversational Structure , 2006, Int. J. Humanoid Robotics.

[3]  Amy J. C. Cuddy,et al.  Universal dimensions of social cognition: warmth and competence , 2007, Trends in Cognitive Sciences.

[4]  Catherine Pelachaud,et al.  Modeling Multimodal Behaviors from Speech Prosody , 2013, IVA.

[5]  Carlos Busso,et al.  Generating Human-Like Behaviors Using Joint, Speech-Driven Models for Conversational Agents , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Stefan Kopp,et al.  Individualized Gesturing Outperforms Average Gesturing - Evaluating Gesture Production in Virtual Humans , 2010, IVA.

[7]  Volker Strom,et al.  Visual prosody: facial movements accompanying speech , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[8]  Sergey Levine,et al.  Real-time prosody-driven synthesis of body language , 2009, SIGGRAPH 2009.

[9]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[10]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[11]  Christoph Bregler,et al.  Mood swings: expressive speech animation , 2005, TOGS.

[12]  Zhigang Deng,et al.  Natural head motion synthesis driven by acoustic prosodic features , 2005, Comput. Animat. Virtual Worlds.

[13]  Zhigang Deng,et al.  Live Speech Driven Head-and-Eye Motion Generators , 2012, IEEE Transactions on Visualization and Computer Graphics.

[14]  Jeffery A. Jones,et al.  Visual Prosody and Speech Intelligibility , 2004, Psychological science.

[15]  Stacy Marsella,et al.  Modeling Speaker Behavior: A Comparison of Two Approaches , 2012, IVA.

[16]  Sergey Levine,et al.  Gesture controllers , 2010, SIGGRAPH 2010.

[17]  Stefan Kopp,et al.  AsapRealizer 2.0: The Next Steps in Fluent Behavior Realization for ECAs , 2014, IVA.