User Simulation for Spoken Dialog System Development

A user simulation is a computer program which simulates human user behaviors. Recently, user simulations have been widely used in two spoken dialog system development tasks. One is to generate large simulated corpora for applying machine learning to learn new dialog strategies, and the other is to replace human users to test dialog system performance. Although previous studies have shown successful examples of applying user simulations in both tasks, it is not clear what type of user simulation is most appropriate for a specific task because few studies compare different user simulations in the same experimental setting. In this research, we investigate how to construct user simulations in a specific task for spoken dialog system development. Since most current user simulations generate user actions based on probabilistic models, we identify two main factors in constructing such user simulations: the choice of user simulation model and the approach to set up user action probabilities. We build different user simulation models which differ in their efforts in simulating realistic user behaviors and exploring more user actions. We also investigate different manual and trained approaches to set up user action probabilities. We introduce both task-dependent and task-independent measures to compare these simulations. We show that a simulated user which mimics realistic user behaviors is not always necessary for the dialog strategy learning task. For the dialog system testing task, a user simulation which simulates user behaviors in a statistical way can generate both objective and subjective measures of dialog system performance similar to human users. Our research examines the strengths and weaknesses of user simulations in spoken dialog system development. Although our results are constrained to our task domain and the resources available, we provide a general framework for comparing user simulations in a task-dependent context. In addition, we summarize and validate a set of evaluation measures that can be used in comparing different simulated users as well as simulated versus human users.

[1]  Oliver Lemon,et al.  A Corpus Collection and Annotation Framework for Learning Multimodal Clarification Strategies , 2005, SIGDIAL.

[2]  Marilyn A. Walker,et al.  SPoT: A Trainable Sentence Planner , 2001, NAACL.

[3]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[4]  Kenneth R. Koedinger,et al.  Building Cognitive Tutors with Programming by Demonstration , 2005 .

[5]  Anton Leuski,et al.  Radiobot-CFF: a spoken dialogue system for military training , 2006, INTERSPEECH.

[6]  Markku Turunen,et al.  Evaluation of a spoken dialogue system with usability tests and long-term pilot studies: similarities and differences , 2006, INTERSPEECH.

[7]  Diane J. Litman,et al.  Responding to Student Uncertainty During Computer Tutoring: An Experimental Evaluation , 2008, Intelligent Tutoring Systems.

[8]  Mary Ellen Foster Automated Metrics That Agree With Human Judgements On Generated Output for an Embodied Conversational Agent , 2008, INLG.

[9]  Shimei Pan,et al.  Designing and Evaluating an Adaptive Spoken Dialogue System , 2002, User Modeling and User-Adapted Interaction.

[10]  Kallirroi Georgila,et al.  Hybrid reinforcement/supervised learning for dialogue policies from COMMUNICATOR data , 2005 .

[11]  Tetsuya Ogata,et al.  Dynamic help generation by estimating user²s mental model in spoken dialogue systems , 2006, INTERSPEECH.

[12]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[13]  Alexander I. Rudnicky,et al.  Ravenclaw: dialog management using hierarchical task decomposition and an expectation agenda , 2003, INTERSPEECH.

[14]  Hua Ai,et al.  Comparing User Simulation Models For Dialog Strategy Learning , 2007, HLT-NAACL.

[15]  Hua Ai,et al.  Assessing Dialog System User Simulation Evaluation Measures Using Human Judges , 2008, ACL.

[16]  Grace Chung,et al.  Developing a Flexible Spoken Dialog System Using Simulation , 2004, ACL.

[17]  Roberto Pieraccini,et al.  User modeling for spoken dialogue system evaluation , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[18]  Marilyn A. Walker,et al.  Towards developing general models of usability with PARADISE , 2000, Natural Language Engineering.

[19]  Edward Filisko,et al.  Developing attribute acquisition strategies in spoken dialogue systems via user simulation , 2006 .

[20]  J.D. Williams,et al.  Scaling up POMDPs for Dialog Management: The ``Summary POMDP'' Method , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[21]  Jason D. Williams,et al.  A method for evaluating and comparing user simulations: The Cramér-von Mises divergence , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[22]  H. Grice Logic and conversation , 1975 .

[23]  Roberto Pieraccini,et al.  A stochastic model of human-machine interaction for learning dialog strategies , 2000, IEEE Trans. Speech Audio Process..

[24]  Hui Ye,et al.  Agenda-Based User Simulation for Bootstrapping a POMDP Dialogue System , 2007, NAACL.

[25]  Oliver Lemon,et al.  Cluster-based user simulations for learning dialogue strategies , 2006, INTERSPEECH.

[26]  H. Bratt,et al.  CHAT: a conversational helper for automotive tasks , 2006, INTERSPEECH.

[27]  Eric K. Ringger,et al.  A Robust System for Natural Spoken Dialogue , 1996, ACL.

[28]  Joel R. Tetreault,et al.  Estimating the Reliability of MDP Policies: a Confidence Interval Approach , 2007, HLT-NAACL.

[29]  S. Young,et al.  Scaling POMDPs for Spoken Dialog Management , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Kallirroi Georgila,et al.  Learning user simulations for information state update dialogue systems , 2005, INTERSPEECH.

[31]  S. Singh,et al.  Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System , 2011, J. Artif. Intell. Res..

[32]  Elmar Nöth,et al.  How to find trouble in communication , 2003, Speech Commun..

[33]  Paul Over,et al.  The Effects of Human Variation in DUC Summarization Evaluation , 2004 .

[34]  Kurt VanLehn,et al.  The Behavior of Tutoring Systems , 2006, Int. J. Artif. Intell. Educ..

[35]  Sebastian Möller,et al.  Evaluating spoken dialogue systems according to de-facto standards: A case study , 2007, Comput. Speech Lang..

[36]  Alex Kulesza,et al.  Confidence Estimation for Machine Translation , 2004, COLING.

[37]  Hua Ai,et al.  Knowledge consistent user simulations for dialog systems , 2007, INTERSPEECH.

[38]  Marilyn A. Walker,et al.  Empirical Evaluation of a Reinforcement Learning Spoken Dialogue System , 2000, AAAI/IAAI.

[39]  Mattias Heldner,et al.  Towards human-like spoken dialogue systems , 2008, Speech Commun..

[40]  Marilyn A. Walker,et al.  Generation and evaluation of user tailored responses in multimodal dialogue , 2004 .

[41]  Maxine Eskénazi,et al.  Let's go public! taking a spoken dialog system to the real world , 2005, INTERSPEECH.

[42]  Sebastian Möller,et al.  Predicting the quality and usability of spoken dialogue services , 2008, Speech Commun..

[43]  Diane J. Litman,et al.  Speech recognition performance and learning in spoken dialogue tutoring , 2005, INTERSPEECH.

[44]  Tim Paek,et al.  Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths and Weaknesses for Practical Deployment , 2006 .

[45]  Hua Ai,et al.  Setting Up User Action Probabilities in User Simulations for Dialog System Development , 2009, ACL/IJCNLP.

[46]  Kenneth R. Koedinger,et al.  Learning Factors Analysis - A General Method for Cognitive Model Evaluation and Improvement , 2006, Intelligent Tutoring Systems.

[47]  Marilyn A. Walker,et al.  Using Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text , 2007, J. Artif. Intell. Res..

[48]  Marilyn A. Walker,et al.  Testing collaborative strategies by computational simulation: cognitive and task effects , 1995, Knowl. Based Syst..

[49]  Oliver Lemon,et al.  Evaluation of a hierarchical reinforcement learning spoken dialogue system , 2010, Comput. Speech Lang..

[50]  Eli Hagen,et al.  Adaptive Help for Speech Dialogue Systems Based on Learning and Forgetting of Speech Commands , 2006, SIGDIAL Workshop.

[51]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[52]  Steve J. Young,et al.  A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies , 2006, The Knowledge Engineering Review.

[53]  Toni Giorgino,et al.  Evaluation and Usage Patterns in the Homey Hypertension Management Dialog System , 2004, AAAI Technical Report.

[54]  Diane J. Litman,et al.  ITSPOKE: An Intelligent Tutoring Spoken Dialogue System , 2004, NAACL.

[55]  Steve J. Young,et al.  Error simulation for training statistical dialogue systems , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[56]  Gregory A. Sanders,et al.  DARPA communicator dialog travel planning systems: the june 2000 data collection , 2001, INTERSPEECH.

[57]  Carolyn Penstein Rosé,et al.  The Architecture of Why2-Atlas: A Coach for Qualitative Physics Essay Writing , 2002, Intelligent Tutoring Systems.

[58]  Joel R. Tetreault,et al.  Comparing Synthesized versus Pre-Recorded Tutor Speech in an Intelligent Tutoring Spoken Dialogue System , 2006, FLAIRS.

[59]  Kallirroi Georgila,et al.  EVALUATING EFFECTIVENESS AND PORTABILITY OF REINFORCEMENT LEARNED DIALOGUE STRATEGIES WITH REAL USERS: THE TALK TOWNINFO EVALUATION , 2006, 2006 IEEE Spoken Language Technology Workshop.

[60]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[61]  Hsien-Te Cheng,et al.  Algorithms for partially observable markov decision processes , 1989 .

[62]  Gary Geunbae Lee,et al.  Data-driven user simulation for automated evaluation of spoken dialog systems , 2009, Comput. Speech Lang..

[63]  Sebastian Möller,et al.  Evaluating system metaphors via the speech output of a smart home system , 2004, INTERSPEECH.

[64]  Steve Young,et al.  Statistical User Simulation with a Hidden Agenda , 2007, SIGDIAL.

[65]  Maxine Eskénazi,et al.  Doing research on a deployed spoken dialogue system: one year of let's go! experience , 2006, INTERSPEECH.

[66]  Sebastian Möller,et al.  Analysis of a new simulation approach to dialog system evaluation , 2009, Speech Commun..

[67]  Diane J. Litman,et al.  Comparing real-real, simulated-simulated, and simulated-real spoken dialogue corpora , 2006 .

[68]  Joel R. Tetreault,et al.  Comparing the Utility of State Features in Spoken Dialogue Using Reinforcement Learning , 2006, NAACL.

[69]  Hua Ai,et al.  User Simulation as Testing for Spoken Dialog Systems , 2008, SIGDIAL Workshop.

[70]  Robert Graham,et al.  Towards a tool for the Subjective Assessment of Speech System Interfaces (SASSI) , 2000, Natural Language Engineering.

[71]  Alexander I. Rudnicky AN AGENDA-BASED DIALOG MANAGEMENT ARCHITECTURE FOR SPOKEN LANGUAGE SYSTEMS , 1999 .

[72]  Kallirroi Georgila,et al.  User simulation for spoken dialogue systems: learning and evaluation , 2006, INTERSPEECH.

[73]  Hua Ai,et al.  Comparing Spoken Dialog Corpora Collected with Recruited Subjects versus Real Users , 2007, SIGDIAL.

[74]  Harris Wu,et al.  Evaluating Web-based Question Answering Systems , 2002, LREC.

[75]  Oliver Lemon,et al.  Author manuscript, published in "European Conference on Speech Communication and Technologies (Interspeech'07), Anvers: Belgium (2007)" Machine Learning for Spoken Dialogue Systems , 2022 .

[76]  Eric Horvitz,et al.  A computational architecture for conversation , 1999 .

[77]  Ronald Rosenfeld,et al.  Shaping user input in speech graffiti: a first pass , 2006, CHI EA '06.

[78]  Simulating the Behaviour of Older versus Younger Users when Interacting with Spoken Dialogue Systems , 2008, ACL.

[79]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[80]  Stephanie Seneff,et al.  Scalable and portable web-based multimodal dialogue interaction with geographical databases , 2006, INTERSPEECH.

[81]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[82]  Verena Rieser,et al.  Bootstrapping reinforcement learning-based dialogue strategies from wizard-of-oz data , 2008 .