Flexible Turn-Taking for Spoken Dialogue Systems

Most of the research on spoken dialogue systems so far has focused either on higher levels of dialogue or on speech understanding. In contrast, the lowlevel interactional aspects of conversation such as turn-taking have been essentially ignored, leading builders of practical systems to resort to simple pause detection-based methods to handle turn-taking. In a preliminary study based on the Let’s Go bus information system, I found that such methods lead to interaction failures and potentially to complete dialogue breakdowns for a significant proportion of the dialogues with real-world users. In addition, a comparison of conversational rhythm in successful dialogues showed that even when speech recognition is not a major obstacle to communication, systems perform very differently and much less efficiently than human speakers. To address these issues, I propose an approach that relies on two innovations over current dialogue systems. First, it features a new event-driven system architecture that allows real-time processing of conversation, which I implemented in the Olympus/RavenClaw spoken dialogue framework. In addition to the dialogue manager and the traditional understanding and generation modules, a new module, the interaction manager, is in charge of dynamically monitoring and managing low-level interaction phenomena. The second component of the proposed approach is the turn-taking model used by the interaction manager. Inspired by mobile robotics and autonomous agent research, the model is composed of a set of sensors that provide information about the world, a set of actions that the system can take, and an action selection mechanism. Although I will explore different such mechanisms, Reinforcement Learning appears to be an appropriate framework for learning turn-taking behavior. The last expected contribution of this thesis is in the form of an evaluation framework for turn-taking in spoken dialogue systems. This important aspect of the proposed work will include a study of various local and global metrics of turn-taking and dialogue, along with the design and validation of composite metrics. All the theoretical findings and models proposed in this thesis will be grounded and validated in real world applications including Let’s Go and other RavenClaw-based dialogue systems.

[1]  A. Kendon Some functions of gaze-direction in social interaction. , 1967, Acta psychologica.

[2]  Paul T. Brady,et al.  A model for generating on-off speech patterns in two-way conversation , 1969 .

[3]  S. Feldstein,et al.  Rhythms of dialogue , 1970 .

[4]  S. Duncan,et al.  Some Signals and Rules for Taking Speaking Turns in Conversations , 1972 .

[5]  E. Schegloff,et al.  A simplest systematics for the organization of turn-taking for conversation , 1974 .

[6]  William D. Marslen-Wilson,et al.  The On-Line Effects of Semantic Context on Syntactic Processing , 1977 .

[7]  C. Raymond Perrault,et al.  Elements of a Plan-Based Theory of Speech Acts , 1979, Cognitive Sciences.

[8]  C. Raymond Perrault,et al.  Analyzing Intention in Utterances , 1986, Artif. Intell..

[9]  C. Goodwin Conversational Organization: Interaction Between Speakers and Hearers , 1981 .

[10]  G. Beattie Turn-taking and interruption in political interviews: Margaret Thatcher and Jim Callaghan compared and contrasted , 1982 .

[11]  Jay Earley,et al.  An efficient context-free parsing algorithm , 1970, Commun. ACM.

[12]  Bengt Oreström Turn-taking in English conversation , 1983 .

[13]  Daniel Schaffer,et al.  The role of intonation as a cue to turn taking in conversation , 1983 .

[14]  G. Beattie Talk: An Analysis of Speech and Non-Verbal Behaviour in Conversation , 1985 .

[15]  S. Duncan,et al.  Interaction Structure and Strategy , 1985 .

[16]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[17]  Rodney A. Brooks,et al.  A Robust Layered Control Syste For A Mobile Robot , 2022 .

[18]  David Harel,et al.  Statecharts: A Visual Formalism for Complex Systems , 1987, Sci. Comput. Program..

[19]  Mark Steedman,et al.  Interaction with context during human sentence processing , 1988, Cognition.

[20]  Herbert H. Clark,et al.  Contributing to Discourse , 1989, Cogn. Sci..

[21]  W. Levelt,et al.  Speaking: From Intention to Articulation , 1990 .

[22]  Rodney A. Brooks,et al.  Learning to Coordinate Behaviors , 1990, AAAI.

[23]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[24]  Wayne H. Ward Understanding spontaneous speech: the Phoenix system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[25]  Mei-Yuh Hwang,et al.  The SPHINX-II speech recognition system: an overview , 1993, Comput. Speech Lang..

[26]  GLR* – An Efficient Noise-skipping Parsing Algorithm For Context Free Grammars , 1993, IWPT.

[27]  M. David Sadek,et al.  An efficient data-driven model for cooperative spoken dialogue , 1994, ICSLP.

[28]  Bruce Blumberg,et al.  Action-selection in hamsterdam: lessons from ethology , 1994 .

[29]  Wayne H. Ward,et al.  Recent Improvements in the CMU Spoken Language Understanding System , 1994, HLT.

[30]  David P. Miller,et al.  Experiences with an architecture for intelligent, reactive agents , 1995, J. Exp. Theor. Artif. Intell..

[31]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[32]  James F. Allen,et al.  TRAINS-95: Towards a Mixed-Initiative Planning Assistant , 1996, AIPS.

[33]  J. Cassell,et al.  Communicative humanoids: a computational model of psychosocial dialogue skills , 1996 .

[34]  Jorg P. Muller,et al.  The Design of Intelligent Agents: A Layered Approach , 1996 .

[35]  Cecilia E. Ford,et al.  Interaction and grammar: Interactional units in conversation: syntactic, intonational, and pragmatic resources for the management of turns , 1996 .

[36]  Thierry Dutoit,et al.  The MBROLA project: towards a set of high quality speech synthesizers free of use for non commercial purposes , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[37]  Cecilia E. Ford,et al.  Interactional units in conversation: Syntactic, intonational, and pragmatic resources for the mana , 1996 .

[38]  S. Thompson,et al.  The conversational use of reactive tokens in English, Japanese, and Mandarin , 1996 .

[39]  Gwyneth Doherty-Sneddon,et al.  The Reliability of a Dialogue Structure Coding Scheme , 1997, CL.

[40]  Eugene Charniak,et al.  Statistical Techniques for Natural Language Parsing , 1997, AI Mag..

[41]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[42]  Alexander H. Waibel,et al.  Dialogue strategies guiding users to their communicative goals , 1997, EUROSPEECH.

[43]  Mark Humphreys,et al.  Action selection methods using reinforcement learning , 1997 .

[44]  Akira Shimazu,et al.  A System Architecture for Spoken Utterance Production in Collaborative Dialogue , 1997 .

[45]  Jack Mostow,et al.  Adapting Human Tutorial Interventions for a Reading Tutor that Listens: Using Continuous Speech Recognition in Interactive Educational Multimedia , 1997 .

[46]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[47]  Matthew Christopher Bull,et al.  Timing and coordination of turn-taking , 1998 .

[48]  Wataru Tsukahara An algorithm for choosing Japanese acknowledgments using prosodic cues and context , 1998, ICSLP.

[49]  A. Ichikawa,et al.  An Analysis of Turn-Taking and Backchannels Based on Prosodic and Syntactic Features in Japanese Map Task Dialogs , 1998, Language and speech.

[50]  Matthew P. Aylett,et al.  An analysis of the timing of turn-taking in a corpus of goal-oriented dialogue , 1998, ICSLP.

[51]  James F. Allen,et al.  TRIPS: An Integrated Intelligent Problem-Solving Assistant , 1998, AAAI/IAAI.

[52]  Gregory Aist Expanding a time-sensitive conversational architecture for turn-taking to handle content-driven interruption , 1998, ICSLP.

[53]  Victor Zue,et al.  GALAXY-II: a reference architecture for conversational system development , 1998, ICSLP.

[54]  Joseph Polifroni,et al.  Organization, communication, and control in the GALAXY-II conversational system , 1999, EUROSPEECH.

[55]  Alexander I. Rudnicky,et al.  Creating natural dialogs in the carnegie mellon communicator system , 1999, EUROSPEECH.

[56]  Amanda Stent,et al.  The CommandTalk Spoken Dialogue System , 1999, ACL.

[57]  Alexander I. Rudnicky AN AGENDA-BASED DIALOG MANAGEMENT ARCHITECTURE FOR SPOKEN LANGUAGE SYSTEMS , 1999 .

[58]  Mikio Nakano,et al.  Handling rich turn-taking in spoken dialogue systems , 1999, EUROSPEECH.

[59]  Mikio Nakano,et al.  Understanding Unsegmented User Utterances in Real-Time Spoken Dialogue Systems , 1999, ACL.

[60]  Jack Mostow,et al.  MEASURING THE EFFECTS OF BACKCHANNELING IN COMPUTERIZED ORAL READING TUTORING , 1999 .

[61]  Staffan Larsson,et al.  Information state and dialogue management in the TRINDI dialogue move engine toolkit , 2000, Natural Language Engineering.

[62]  Susan Brennan,et al.  Processes that shape conversation and their implications for computational linguistics , 2000, ACL 2000.

[63]  Mikio Nakano,et al.  WIT: A Toolkit for Building Robust and Real-Time Spoken Dialogu Systems , 2000, SIGDIAL Workshop.

[64]  Alfons Crespo,et al.  Flexible Real-Time Architecture for Hybrid Mobile Robotic Applications , 2000 .

[65]  Nigel G. Ward,et al.  Prosodic features which cue back-channel responses in English and Japanese , 2000 .

[66]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[67]  Alexander I. Rudnicky,et al.  Task and domain specific modelling in the Carnegie Mellon communicator system , 2000, INTERSPEECH.

[68]  Alan W. Black,et al.  Limited domain synthesis , 2000, INTERSPEECH.

[69]  Alexander I. Rudnicky,et al.  Stochastic Language Generation for Spoken Dialogue Systems , 2000 .

[70]  James F. Allen,et al.  An architecture for more realistic conversational systems , 2001, IUI '01.

[71]  Rafik A. Goubran,et al.  Robust voice activity detection using higher-order statistics in the LPC residual domain , 2001, IEEE Trans. Speech Audio Process..

[72]  Stanley Peters,et al.  The WITAS multi-modal dialogue system I , 2001, INTERSPEECH.

[73]  Hao Yan,et al.  More than just a pretty face: conversational protocols and the affordances of embodiment , 2001, Knowl. Based Syst..

[74]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[75]  Juha Häkkinen,et al.  Robust end-of-utterance detection for real-time speech recognition applications , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[76]  Johan Boye,et al.  Real-time Handling of Fragmented Utterances , 2001 .

[77]  Hiroko Furo Turn-taking in English and Japanese: Projectability in Grammar, Intonation, and Semantics , 2001 .

[78]  Yasuyoshi Inagaki,et al.  Incremental parsing for interactive natural language interface , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[79]  Matthias Scheutz,et al.  Contention Scheduling: A Viable Action-Selection Mechanism for Robotics? , 2002 .

[80]  Johan Bos,et al.  An Inference-based Approach to Dialogue System Design , 2002, COLING.

[81]  B. Granström,et al.  NATURAL TURN-TAKING NEEDS NO MANUAL : COMPUTATIONAL THEORY AND MODEL , FROM PERCEPTION TO ACTION , 2002 .

[82]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[83]  Alexander I. Rudnicky,et al.  Integrating Multiple Knowledge Sources for Utterance-Level Confidence Annotation in the CMU Communicator Spoken Dialog System , 2002 .

[84]  Birger Kollmeier,et al.  Speech pause detection for noise spectrum estimation by tracking power envelope dynamics , 2002, IEEE Trans. Speech Audio Process..

[85]  Gregory A. Sanders,et al.  DARPA communicator: cross-system results for the 2001 evaluation , 2002, INTERSPEECH.

[86]  Andreas Stolcke,et al.  Is the speaker done yet? faster and more accurate end-of-utterance detection using prosody , 2002, INTERSPEECH.

[87]  Andreas Stolcke,et al.  Prosody-based automatic detection of annoyance and frustration in human-computer dialog , 2002, INTERSPEECH.

[88]  Helsingin Yliopisto Prosodic features associated with the distribution of turns in Finnish informal dialogues , 2002 .

[89]  Carolyn Penstein Rosé,et al.  An efficient incremental architecture for robust interpretation , 2002 .

[90]  Yves Normandin,et al.  Robust semantic confidence scoring , 2002, INTERSPEECH.

[91]  Charles Rich,et al.  A plug-in architecture for generating collaborative agent responses , 2002, AAMAS '02.

[92]  Sharon L. Oviatt,et al.  Adaptation of users² spoken dialogue patterns in a conversational interface , 2002, INTERSPEECH.

[93]  S. Singh,et al.  Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System , 2011, J. Artif. Intell. Res..

[94]  Mikio Nakano,et al.  Learning decision trees to determine turn-taking by spoken dialogue systems , 2002, INTERSPEECH.

[95]  Björn Granström,et al.  Multimodality in Language and Speech Systems , 2002 .

[96]  Antoine Raux,et al.  A unit selection approach to F0 modeling and its application to emphasis , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[97]  Oliver Lemon,et al.  Managing Dialogue Interaction: A Multi-Layered Approach , 2003, SIGDIAL Workshop.

[98]  Andreas Stolcke,et al.  A prosody-based approach to end-of-utterance detection that does not require speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[99]  C. Raymond Perrault,et al.  Elements of a Plan-Based Theory of Speech Acts , 1979, Cogn. Sci..

[100]  Fernando Farf́an,et al.  EVALUATING DIALOGUE STRATEGIES IN A SPOKEN DIALOGUE SYSTEM FOR EMAIL , 2003 .

[101]  G. Altmann,et al.  The time-course of prediction in incremental sentence processing: Evidence from anticipatory eye-movements , 2003 .

[102]  Alexander I. Rudnicky,et al.  Ravenclaw: dialog management using hierarchical task decomposition and an expectation agenda , 2003, INTERSPEECH.

[103]  Jean Carletta,et al.  A shallow model of backchannel continuers in spoken dialogue , 2003 .

[104]  J. Hulstijn,et al.  Turntaking: a case for agent-based programming , 2003 .

[105]  Kristiina Jokinen,et al.  Generation Models for Spoken Dialogues , 2003 .

[106]  Adam Cheyer,et al.  The Open Agent Architecture , 1997, Autonomous Agents and Multi-Agent Systems.

[107]  Diane J. Litman,et al.  ITSPOKE: An Intelligent Tutoring Spoken Dialogue System , 2004, NAACL.

[108]  Julia Hirschberg,et al.  Prosodic and other cues to speech recognition failures , 2004, Speech Commun..

[109]  Joel R. Tetreault,et al.  Incremental Parsing with Reference Interaction , 2004 .

[110]  Robert Porzel,et al.  The Tao of CHI: Towards Effective Human-Computer Interaction , 2004, NAACL.

[111]  Heather H. Mitchell,et al.  AutoTutor: A tutor with dialogue in natural language , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[112]  Seiichi Nakagawa,et al.  Timing Detection for Realtime Dialog Systems Using Prosodic and Linguistic Information , 2004 .

[113]  Oliver Lemon Context-sensitive speech recognition in ISU dialogue systems: results for the grammar switching approach , 2004 .

[114]  Stanley Peters,et al.  A conversational dialogue system for cognitively overloaded users , 2004, INTERSPEECH.

[115]  Jens Edlund,et al.  Higgins - a spoken dialogue system for investigating error handling techniques , 2004, INTERSPEECH.

[116]  Oliver Lemon,et al.  Combining Acoustic and Pragmatic Features to Predict Recognition Performance in Spoken Dialogue Systems , 2004, ACL.

[117]  Louis ten Bosch,et al.  Durational Aspects of Turn-Taking in Spontaneous Face-to-Face and Telephone Dialogues , 2004, TSD.

[118]  Alexander I. Rudnicky,et al.  Heterogeneous Multi-Robot Dialogues for Search Tasks , 2005 .

[119]  Yasuyoshi Inagaki,et al.  Incremental dependency parsing based on headed context-free grammar , 2005, Systems and Computers in Japan.

[120]  Dilek Z. Hakkani-Tür,et al.  Using context to improve emotion detection in spoken dialog systems , 2005, INTERSPEECH.

[121]  Stephanie Rosenthal,et al.  Designing robots for long-term social interaction , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[122]  Alexander I. Rudnicky,et al.  A principled approach for rejection threshold optimization in spoken dialog systems , 2005, INTERSPEECH.

[123]  Alexander I. Rudnicky,et al.  Error Handling in the RavenClaw Dialog Management Architecture , 2005, HLT/EMNLP.

[124]  Alexander I. Rudnicky,et al.  LARRI: A Language-Based Maintenance and Repair Assistant , 2005 .

[125]  Maxine Eskénazi,et al.  Let's go public! taking a spoken dialog system to the real world , 2005, INTERSPEECH.

[126]  Kallirroi Georgila,et al.  Hybrid reinforcement/supervised learning for dialogue policies from COMMUNICATOR data , 2005 .

[127]  R. J. J. H. van Son,et al.  Timing of experimentally elicited minimal responses as quantitative evidence for the use of intonation in projecting TRPs , 2005, INTERSPEECH.

[128]  Ellen Campana,et al.  Two Diverse Systems Built using Generic Components for Spoken Dialogue (Recent Progress on TRIPS) , 2005, ACL.

[129]  David G. Novick,et al.  Root causes of lost time and user stress in a simple dialog system , 2005, INTERSPEECH.

[130]  Katsuhito Sudoh,et al.  Incorporating discourse features into confidence scoring of intention recognition results in spoken dialogue systems , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[131]  Maxine Eskénazi,et al.  Doing research on a deployed spoken dialogue system: one year of let's go! experience , 2006, INTERSPEECH.

[132]  Tatsuya Kawahara,et al.  Voice activity detector based on enhanced cumulant of LPC residual and on-line EM algorithm , 2006, INTERSPEECH.

[133]  Alan W. Black,et al.  CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[134]  Alexander I. Rudnicky,et al.  Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[135]  Diane J. Litman,et al.  Discourse structure and speech recognition problems , 2006, INTERSPEECH.

[136]  Joel R. Tetreault,et al.  Using system and user performance features to improve emotion detection in spoken tutoring dialogs , 2006, INTERSPEECH.

[137]  Diane J. Litman,et al.  Exploiting Discourse Structure for Spoken Dialogue Performance Analysis , 2006, EMNLP.

[138]  Fredrik Kronlid,et al.  Turn Taking for Artificial Conversational Agents , 2006, CIA.

[139]  Juan Manuel Górriz,et al.  Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .

[140]  Frédéric Béchet,et al.  Spoken Language Understanding Strategies on the France Telecom 3000 Voice Agency Corpus , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[141]  Alexander I. Rudnicky,et al.  Error awareness and recovery in conversational spoken language interfaces , 2007 .

[142]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[143]  Hua Ai,et al.  Comparing Spoken Dialog Corpora Collected with Recruited Subjects versus Real Users , 2007, SIGDIAL.

[144]  Alexander I. Rudnicky,et al.  Olympus: an open-source framework for conversational spoken language interface research , 2007, HLT-NAACL 2007.

[145]  Joel R. Tetreault,et al.  Exploring Affect-Context Dependencies for Adaptive System Development , 2007, HLT-NAACL.

[146]  Johan Schalkwyk,et al.  Deploying GOOG-411: Early lessons in data, measurement, and testing , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[147]  Fredrik Kronlid Steps towards Multi-Party Dialogue Management , 2008 .

[148]  Alexander I. Rudnicky,et al.  Sorry, I Didn’t Catch That! , 2008 .