Optimizing the turn-taking behavior of task-oriented spoken dialog systems

Even as progress in speech technologies and task and dialog modeling has allowed the development of advanced spoken dialog systems, the low-level interaction behavior of those systems often remains rigid and inefficient. Based on an analysis of human-human and human-computer turn-taking in naturally occurring task-oriented dialogs, we define a set of features that can be automatically extracted and show that they can be used to inform efficient end-of-turn detection. We then frame turn-taking as decision making under uncertainty and describe the Finite-State Turn-Taking Machine (FSTTM), a decision-theoretic model that combines data-driven machine learning methods and a cost structure derived from Conversation Analysis to control the turn-taking behavior of dialog systems. Evaluation results on CMU Let's Go, a publicly deployed bus information system, confirm that the FSTTM significantly improves the responsiveness of the system compared to a standard threshold-based approach, as well as previous data-driven methods.

[1]  Alexander I. Rudnicky,et al.  Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2]  Andreas Stolcke,et al.  A prosody-based approach to end-of-utterance detection that does not require speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[3]  Maxine Eskénazi,et al.  Doing research on a deployed spoken dialogue system: one year of let's go! experience , 2006, INTERSPEECH.

[4]  Alexander I. Rudnicky,et al.  Integrating Multiple Knowledge Sources for Utterance-Level Confidence Annotation in the CMU Communicator Spoken Dialog System , 2002 .

[5]  Takayuki Kanda,et al.  Footing in human-robot conversations: How robots might shape participant roles using gaze cues , 2009, 2009 4th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[6]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[7]  Maxine Eskénazi,et al.  A Finite-State Turn-Taking Model for Spoken Dialog Systems , 2009, NAACL.

[8]  S. Thompson,et al.  The conversational use of reactive tokens in English, Japanese, and Mandarin , 1996 .

[9]  Fredrik Kronlid,et al.  Turn Taking for Artificial Conversational Agents , 2006, CIA.

[10]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[11]  David Schlangen,et al.  Towards Incremental End-of-Utterance Detection in Dialogue Systems , 2008, COLING.

[12]  Kristinn R. Thórisson,et al.  Natural Turn-Taking Needs No Manual: Computational Theory and Model, from Perception to Action , 2002 .

[13]  Alexander I. Rudnicky,et al.  Ravenclaw: dialog management using hierarchical task decomposition and an expectation agenda , 2003, INTERSPEECH.

[14]  EskenaziMaxine,et al.  Optimizing the turn-taking behavior of task-oriented spoken dialog systems , 2012 .

[15]  Stefan Kopp,et al.  Middleware for Incremental Processing in Conversational Agents , 2010, SIGDIAL Conference.

[16]  Jan-Peter de Holger N. J. Ruiter,et al.  Projecting the End of a Speaker's Turn: A Cognitive Cornerstone of Conversation , 2006 .

[17]  Björn Granström,et al.  Multimodality in Language and Speech Systems , 2002 .

[18]  A. Ichikawa,et al.  An Analysis of Turn-Taking and Backchannels Based on Prosodic and Syntactic Features in Japanese Map Task Dialogs , 1998, Language and speech.

[19]  Sandra A. Thompson,et al.  Interaction and grammar: Frontmatter , 1996 .

[20]  Alexander I. Rudnicky,et al.  Implicitly-supervised Learning in Spoken Language Interfaces: an Application to the Confidence Annotation Problem , 2007, SIGDIAL.

[21]  E. Schegloff,et al.  A simplest systematics for the organization of turn-taking for conversation , 1974 .

[22]  Louis-Philippe Morency,et al.  A multimodal end-of-turn prediction model: learning from parasocial consensus sampling , 2011, AAMAS.

[23]  Maxine Eskénazi,et al.  Optimizing Endpointing Thresholds using Dialogue Features in a Spoken Dialogue System , 2008, SIGDIAL Workshop.

[24]  Mattias Heldner,et al.  A single-port non-parametric model of turn-taking in multi-party conversation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Kornel Laskowski,et al.  Modeling Norms of Turn-Taking in Multi-Party Conversation , 2010, ACL.

[26]  S. Duncan,et al.  Some Signals and Rules for Taking Speaking Turns in Conversations , 1972 .

[27]  David G. Novick,et al.  Root causes of lost time and user stress in a simple dialog system , 2005, INTERSPEECH.

[28]  David DeVault,et al.  Can I Finish? Learning When to Respond to Incremental Interpretation Results in Interactive Dialogue , 2009, SIGDIAL Conference.

[29]  Maxine Eskénazi,et al.  Let's go public! taking a spoken dialog system to the real world , 2005, INTERSPEECH.

[30]  Robert Porzel,et al.  The Tao of CHI: Towards Effective Human-Computer Interaction , 2004, NAACL.

[31]  Eric Horvitz,et al.  Conversation as Action Under Uncertainty , 2000, UAI.

[32]  Maxine Eskénazi,et al.  LET's GO: improving spoken dialog systems for the elderly and non-natives , 2003, INTERSPEECH.

[33]  Louis-Philippe Morency,et al.  A probabilistic multimodal approach for predicting listener backchannels , 2009, Autonomous Agents and Multi-Agent Systems.

[34]  R. J. J. H. van Son,et al.  Timing of experimentally elicited minimal responses as quantitative evidence for the use of intonation in projecting TRPs , 2005, INTERSPEECH.

[35]  Eric Horvitz,et al.  Multiparty Turn Taking in Situated Dialog: Study, Lessons, and Directions , 2011, SIGDIAL Conference.

[36]  Alexander I. Rudnicky,et al.  The RavenClaw dialog management framework: Architecture and systems , 2009, Comput. Speech Lang..

[37]  J. Oberlander,et al.  Using Facial Feedback to Enhance Turn-Taking in a Multimodal Dialogue System , 2005 .

[38]  Maxine Eskénazi,et al.  Spoken Dialog Challenge 2010: Comparison of Live and Control Test Results , 2011, SIGDIAL Conference.

[39]  Hiroko Furo Turn-taking in English and Japanese: Projectability in Grammar, Intonation, and Semantics , 2001 .

[40]  Matthew Christopher Bull,et al.  Timing and coordination of turn-taking , 1998 .

[41]  G. Beattie Turn-taking and interruption in political interviews: Margaret Thatcher and Jim Callaghan compared and contrasted , 1982 .

[42]  Mikio Nakano,et al.  Learning decision trees to determine turn-taking by spoken dialogue systems , 2002, INTERSPEECH.

[43]  Gwyneth Doherty-Sneddon,et al.  The Reliability of a Dialogue Structure Coding Scheme , 1997, CL.

[44]  Antoine Raux Flexible Turn-Taking for Spoken Dialogue Systems , 2006 .

[45]  Andrea Lockerd Thomaz,et al.  Simon plays Simon says: The timing of turn-taking in an imitation game , 2011, 2011 RO-MAN.

[46]  Julia Hirschberg,et al.  Turn-taking cues in task-oriented dialogue , 2011, Comput. Speech Lang..

[47]  S. Feldstein,et al.  Rhythms of dialogue , 1970 .

[48]  Wayne H. Ward,et al.  Recent Improvements in the CMU Spoken Language Understanding System , 1994, HLT.

[49]  I Hutchby,et al.  Interaction and grammar. , 1998 .

[50]  E. Schegloff Overlapping talk and the organization of turn-taking for conversation , 2000, Language in Society.

[51]  Matthew P. Aylett,et al.  An analysis of the timing of turn-taking in a corpus of goal-oriented dialogue , 1998, ICSLP.

[52]  Cecilia E. Ford,et al.  Interaction and grammar: Interactional units in conversation: syntactic, intonational, and pragmatic resources for the management of turns , 1996 .

[53]  Bengt Oreström Turn-taking in English conversation , 1983 .

[54]  Seiichi Nakagawa,et al.  Timing Detection for Realtime Dialog Systems Using Prosodic and Linguistic Information , 2004 .

[55]  Maxine Eskénazi,et al.  A multi-layer architecture for semi-synchronous event-driven dialogue management , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[56]  Olac Fuentes,et al.  Dialog prediction for a general model of turn-taking , 2010, INTERSPEECH.

[57]  Mattias Heldner,et al.  /nailon/ - Software for Online Analysis of Prosody , 2006, INTERSPEECH.