Continuous interaction with a virtual human

This paper presents our progress in developing a Virtual Human capable of being an attentive speaker. Such a Virtual Human should be able to attend to its interaction partner while it is speaking—and modify its communicative behavior on-the-fly based on what it observes in the behavior of its partner. We report new developments concerning a number of aspects, such as scheduling and interrupting multimodal behavior, automatic classification of listener responses, generation of response eliciting behavior, and strategies for generating appropriate reactions to listener responses. On the basis of this progress, a task-based setup for a responsive Virtual Human was implemented to carry out two user studies, the results of which are presented and discussed in this paper.

[1]  Herbert H. Clark,et al.  Grounding in communication , 1991, Perspectives on socially shared cognition.

[2]  S. Duncan,et al.  On the structure of speaker–auditor interaction during speaking turns , 1974, Language in Society.

[3]  Shrikanth S. Narayanan,et al.  An analysis of multimodal cues of interruption in dyadic spoken interactions , 2008, INTERSPEECH.

[4]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[5]  Dennis Reidsma,et al.  Annotations and subjective machines of annotators, embodied agents, users, and other humans , 2008 .

[6]  Dirk Heylen,et al.  How Turn-Taking Strategies Influence Users' Impressions of an Agent , 2010, IVA.

[7]  Joakim Gustafson,et al.  Prosodic cues to engagement in non-lexical response tokens in Swedish , 2010, DiSS-LPSS.

[8]  Stefan Kopp,et al.  Social resonance and embodied coordination in face-to-face conversation with artificial interlocutors , 2010, Speech Commun..

[9]  Gabriel Skantze,et al.  A General, Abstract Model of Incremental Dialogue Processing , 2011 .

[10]  J. Bavelas,et al.  Listener Responses as a Collaborative Process: The Role of Gaze , 2002 .

[11]  C. Goodwin Conversational Organization: Interaction Between Speakers and Hearers , 1981 .

[12]  Heiga Zen,et al.  AN HMM-BASED SPEECH SYNTHESIS SYSTEM APPLIED TO ENGLISH , 2003 .

[13]  Louis-Philippe Morency,et al.  A probabilistic multimodal approach for predicting listener backchannels , 2009, Autonomous Agents and Multi-Agent Systems.

[14]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[15]  Dennis Reidsma,et al.  Elckerlyc - A BML Realizer for continuous, multimodal interaction with a Virtual Human , 2009 .

[16]  D. Fujimoto,et al.  Listener Responses in Interaction : A Case for Abandoning the Term, Backchannel , 2007 .

[17]  Marc Schröder,et al.  The SEMAINE API: Towards a Standards-Based Framework for Building Emotion-Oriented Systems , 2010, Adv. Hum. Comput. Interact..

[18]  H. H. Clark,et al.  Speaking while monitoring addressees for understanding , 2004 .

[19]  Marc Schröder,et al.  The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching , 2003, Int. J. Speech Technol..

[20]  Hervé Bourlard,et al.  Mel-cepstrum modulation spectrum (MCMS) features for robust ASR , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[21]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[22]  Anton Nijholt,et al.  Mutually Coordinated Anticipatory Multimodal Interaction , 2008, COST 2102 Workshop.

[23]  Stefan Kopp,et al.  Towards a Common Framework for Multimodal Generation: The Behavior Markup Language , 2006, IVA.

[24]  Paul T. Brady,et al.  A statistical analysis of on-off patterns in 16 conversations , 1968 .

[25]  Michael Neff,et al.  Proceedings of the 12th international conference on Intelligent Virtual Agents , 2012 .

[26]  S. Duncan,et al.  Some Signals and Rules for Taking Speaking Turns in Conversations , 1972 .

[27]  E. Schegloff Overlapping talk and the organization of turn-taking for conversation , 2000, Language in Society.

[28]  Julia Hirschberg,et al.  Backchannel-inviting cues in task-oriented dialogue , 2009, INTERSPEECH.

[29]  O. J. Murphy,et al.  Characteristic time intervals in telephonic conversation , 1938 .

[30]  Jean Carletta,et al.  A shallow model of backchannel continuers in spoken dialogue , 2003 .

[31]  Elizabeth Shriberg,et al.  Meeting Recorder Project: Dialog Act Labeling Guide , 2004 .

[32]  J. Bavelas,et al.  Listeners as co-narrators. , 2000, Journal of personality and social psychology.

[33]  Björn Granström,et al.  Multimodality in Language and Speech Systems , 2002 .

[34]  E. Schegloff,et al.  A simplest systematics for the organization of turn-taking for conversation , 1974 .

[35]  Nigel G. Ward,et al.  Prosodic features which cue back-channel responses in English and Japanese , 2000 .

[36]  Stacy Marsella,et al.  SmartBody: behavior realization for embodied conversational agents , 2008, AAMAS.

[37]  M. Walker,et al.  Smooth Transitions in Conversational Interactions , 1982 .

[38]  Stephanie D. Teasley,et al.  Perspectives on socially shared cognition , 1991 .

[39]  Dirk Heylen,et al.  Searching for Prototypical Facial Feedback Signals , 2007, IVA.

[40]  Toshiyuki Sakai,et al.  Spoken-word recognition using dynamic features analysed by two-dimensional cepstrum , 1989 .

[41]  Dirk Heylen,et al.  The MultiLis Corpus - Dealing with Individual Differences in Nonverbal Listening Behavior , 2010, COST 2102 Training School.

[42]  Anne H. Anderson,et al.  The Hcrc Map Task Corpus , 1991 .

[43]  Dirk Heylen,et al.  Head Gestures, Gaze and the Principles of Conversational Structure , 2006, Int. J. Humanoid Robotics.

[44]  Marcela Charfuelan,et al.  The MARY TTS entry in the Blizzard Challenge 2008 , 2008 .

[45]  Guy J. Brown,et al.  Resources for turn competition in overlap in multi-party conversations: speech rate, pausing and duration , 2010, INTERSPEECH.

[46]  Peter French,et al.  Turn-competitive incomings , 1983 .

[47]  C. Goodwin Between and within: Alternative sequential treatments of continuers and assessments , 1986 .

[48]  Nigel Ward,et al.  Non-lexical conversational sounds in American English , 2006 .

[49]  Mattias Heldner,et al.  Very short utterances in conversation , 2010 .

[50]  Kristinn R. Thórisson,et al.  Natural Turn-Taking Needs No Manual: Computational Theory and Model, from Perception to Action , 2002 .

[51]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[52]  Gwyneth Doherty-Sneddon,et al.  The Reliability of a Dialogue Structure Coding Scheme , 1997, CL.

[53]  Daniel Jurafsky,et al.  Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates , 2010, Speech Commun..

[54]  Matthew E. P. Davies,et al.  Evaluation of Audio Beat Tracking and Music Tempo Extraction Algorithms , 2007 .

[55]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[56]  J. Allwood,et al.  A study of gestural feedback expressions , 2006 .

[57]  A. Kendon Some functions of gaze-direction in social interaction. , 1967, Acta psychologica.

[58]  Valerie Manusov,et al.  “Are You Kidding Me?”: The Role of Nonverbal Cues in the Verbal Accounting Process , 2002 .

[59]  Gabriel Skantze,et al.  Towards Incremental Speech Generation in Dialogue Systems , 2010, SIGDIAL Conference.

[60]  Daniel Neiberg,et al.  Classification of Affective Speech using Normalized Time-Frequency Cepstra , 2010 .

[61]  Julia Hirschberg,et al.  The Prosody of Backchannels in American English , 2007 .

[62]  Mattias Heldner,et al.  Pauses, gaps and overlaps in conversations , 2010, J. Phonetics.

[63]  Khiet P. Truong,et al.  Online detection of vocal Listener Responses with maximum latency constraints , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64]  Joakim Gustafson,et al.  The prosody of Swedish conversational grunts , 2010, INTERSPEECH.

[65]  Dennis Reidsma,et al.  A Demonstration of Continuous Interaction with Elckerlyc , 2010 .

[66]  Nikolaos G. Bourbakis,et al.  Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction, COST Action 2102 International Conference, Patras, Greece, October 29-31, 2007. Revised Papers , 2008, COST 2102 Workshop.