An architecture for fluid real-time conversational agents: integrating incremental output generation and input processing

Embodied conversational agents still do not achieve the fluidity and smoothness of natural conversational interaction. One main reason is that current system often respond with big latencies and in inflexible ways. We argue that to overcome these problems, real-time conversational agents need to be based on an underlying architecture that provides two essential features for fast and fluent behavior adaptation: a close bi-directional coordination between input processing and output generation, and incrementality of processing at both stages. We propose an architectural framework for conversational agents [Artificial Social Agent Platform (ASAP)] providing these two ingredients for fluid real-time conversation. The overall architectural concept is described, along with specific means of specifying incremental behavior in BML and technical implementations of different modules. We show how phenomena of fluid real-time conversation, like adapting to user feedback or smooth turn-keeping, can be realized with ASAP and we describe in detail an example real-time interaction with the implemented system.

[1]  Stefan Kopp,et al.  Embodied Gesture Processing: Motor-Based Integration of Perception and Action in Social Artificial Agents , 2010, Cognitive Computation.

[2]  Anton Nijholt,et al.  Towards Bi-directional Dancing Interaction , 2006, ICEC.

[3]  E. Schegloff Overlapping talk and the organization of turn-taking for conversation , 2000, Language in Society.

[4]  Anton Nijholt,et al.  Leading and following with a virtual trainer , 2011 .

[5]  Stefan Kopp,et al.  Towards Conversational Agents That Attend to and Adapt to Communicative User Feedback , 2011, IVA.

[6]  Gabriel Skantze,et al.  A General, Abstract Model of Incremental Dialogue Processing , 2011 .

[7]  Heiko Homann,et al.  Perception through Visuomotor Anticipation in a Mobile Robot , 2006 .

[8]  Julie C. Sedivy,et al.  Subject Terms: Linguistics Language Eyes & eyesight Cognition & reasoning , 1995 .

[9]  Matthew Stone,et al.  Microplanning with Communicative Intentions: The SPUD System , 2001, Comput. Intell..

[10]  David Schlangen,et al.  INPRO_iSS: A Component for Just-In-Time Incremental Speech Synthesis , 2012, ACL.

[11]  Stefan Kopp,et al.  Combining Incremental Language Generation and Incremental Speech Synthesis for Adaptive Information Presentation , 2012, SIGDIAL Conference.

[12]  Stephen T. Wu,et al.  A Framework for Fast Incremental Interpretation during Speech Decoding , 2009, Computational Linguistics.

[13]  David Schlangen,et al.  No sooner said than done? testing incrementality of semantic interpretations of spontaneous speech , 2009, INTERSPEECH.

[14]  Stefan Kopp,et al.  The Behavior Markup Language: Recent Developments and Challenges , 2007, IVA.

[15]  Maurizio Mancini,et al.  Formational parameters and adaptive prototype instantiation for MPEG-4 compliant gesture synthesis , 2002, Proceedings of Computer Animation 2002 (CA 2002).

[16]  Matthew Purver,et al.  On Incrementality in Dialogue: Evidence from Compound Contributions , 2011, Dialogue Discourse.

[17]  H. H. Clark,et al.  Speaking while monitoring addressees for understanding , 2004 .

[18]  J. Cassell,et al.  Communicative humanoids: a computational model of psychosocial dialogue skills , 1996 .

[19]  Agnieszka Wykowska,et al.  How you move is what you see: action planning biases selection in visual search. , 2009, Journal of experimental psychology. Human perception and performance.

[20]  E. Schegloff,et al.  A simplest systematics for the organization of turn-taking for conversation , 2015 .

[21]  Justine Cassell,et al.  Human conversation as a system framework: designing embodied conversational agents , 2001 .

[22]  Dirk Heylen,et al.  Integrating Backchannel Prediction Models into Embodied Conversational Agents , 2012, IVA.

[23]  Stephanie Seneff,et al.  A dynamic vocabulary spoken dialogue interface , 2004, INTERSPEECH.

[24]  Cynthia Breazeal,et al.  Anticipatory Perceptual Simulation for Human-Robot Joint Practice: Theory and Application Study , 2008, AAAI.

[25]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[26]  J. Glasby All together now... , 2003, Nature Reviews Microbiology.

[27]  Stefan Kopp,et al.  A Conversational Agent as Museum Guide - Design and Evaluation of a Real-World Application , 2005, IVA.

[28]  Paul Piwek Incremental Conceptualization for Language Production Markus Guhe (University of Edinburgh) Mahwah, NJ: Lawrence Erlbaum Associates (distributed by Psychology Press), 2007, xii+260 pp; hardbound, ISBN 978-0-8058-5624-8, $75.00 , 2008, Computational Linguistics.

[29]  Pierre Lison,et al.  Salience-driven Contextual Priming of Speech Recognition for Human-Robot Interaction , 2008, ECAI.

[30]  Gabriel Skantze,et al.  Towards Incremental Speech Generation in Dialogue Systems , 2010, SIGDIAL Conference.

[31]  Anton Nijholt,et al.  Temporal interaction between an artificial orchestra conductor and human musicians , 2008, CIE.

[32]  Stefan Kopp,et al.  Embodied Gesture Processing: Motor-based Perception-Action Integration in Social Artificial Agents , 2011 .

[33]  Patrick G. Kenny,et al.  Virtual Justina: A PTSD Virtual Patient for Clinical Classroom Training , 2008 .

[34]  Yuyu Xu,et al.  Perception Markup Language: Towards a Standardized Representation of Perceived Nonverbal Behaviors , 2012, IVA.

[35]  M. Pickering,et al.  Why is conversation so easy? , 2004, Trends in Cognitive Sciences.

[36]  Anton Nijholt,et al.  Mutually Coordinated Anticipatory Multimodal Interaction , 2008, COST 2102 Workshop.

[37]  Stefan Kopp,et al.  Towards a Common Framework for Multimodal Generation: The Behavior Markup Language , 2006, IVA.

[38]  Anton Leuski,et al.  All Together Now - Introducing the Virtual Human Toolkit , 2013, IVA.

[39]  Khiet P. Truong,et al.  Online detection of vocal Listener Responses with maximum latency constraints , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  David DeVault,et al.  Incremental Dialogue Understanding and Feedback for Multiparty, Multimodal Conversation , 2012, IVA.

[41]  R. Street Speech Convergence and Speech Evaluation in Fact-Finding Interviews , 1984 .

[42]  J. Cassell,et al.  Embodied conversational agents , 2000 .

[43]  Stefan Kopp,et al.  Social resonance and embodied coordination in face-to-face conversation with artificial interlocutors , 2010, Speech Commun..

[44]  Oliver Lemon,et al.  multithreaded context for robust conversational interfaces: Context-sensitive speech recognition and interpretation of corrective fragments , 2004, TCHI.

[45]  Ana Paiva,et al.  Thalamus: Closing the Mind-Body Loop in Interactive Embodied Characters , 2012, IVA.

[46]  Stefan Kopp,et al.  Synthesizing multimodal utterances for conversational agents , 2004, Comput. Animat. Virtual Worlds.

[47]  Julia Hirschberg,et al.  Turn-taking cues in task-oriented dialogue , 2011, Comput. Speech Lang..

[48]  Bernhard Hommel,et al.  A computational model of perception and action for cognitive robotics , 2011, Cognitive Processing.

[49]  Stefan Kopp,et al.  An Incremental Multimodal Realizer for Behavior Co-Articulation and Coordination , 2012, IVA.

[50]  Okko Buß,et al.  DIUM – An Incremental Dialogue Manager That Can Produce Self-Corrections , 2011 .

[51]  Stefan Kopp,et al.  Middleware for Incremental Processing in Conversational Agents , 2010, SIGDIAL Conference.

[52]  Marc Cavazza,et al.  Generating context-sensitive ECA responses to user barge-in interruptions , 2012, Journal on Multimodal User Interfaces.