Evaluating spoken dialogue agents with PARADISE: Two case studies

Abstract This paper presents PARADISE (PARAdigm for DIalogue System Evaluation), a general framework for evaluating and comparing the performance of spoken dialogue agents. The framework decouples task requirements from an agent's dialogue behaviours, supports comparisons among dialogue strategies, enables the calculation of performance over subdialogues and whole dialogues, specifies the relative contribution of various factors to performance, and makes it possible to compare agents performing different taks by normalizing for task complexity. After presenting PARADISE, we illustrate its application to two different spoken dialogue agents. We show how to derive a performance function for each agent and how to generalize results across agents. We then show that once such a performance function has been derived, it can be used both for making predictions about future versions of an agent, and as feedback to the agent so that the agent can learn to optimize its behaviour based on its experiences with users over time.

[1]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[2]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[3]  Margaret King,et al.  Evaluating natural language processing systems , 1996, CACM.

[4]  William White,et al.  A Proposal , 2008, Moon, Sun, and Witches.

[5]  Aravind K. Joshi,et al.  34th Annual Meeting of the Association for Computational Linguistics , 1996 .

[6]  Bonnie L. Webber,et al.  Preventing False Inferences , 1984, ACL.

[7]  Marilyn A. Walker,et al.  Mixed Initiative in Dialogue: An Investigation into Discourse Segmentation , 1990, ACL.

[8]  Richard S. Sutton,et al.  Planning by Incremental Dynamic Programming , 1991, ML.

[9]  Elizabeth Shriberg,et al.  Subject-Based Evaluation Measures for Interactive Spoken Language Systems , 1992, HLT.

[10]  Michael K. Brown,et al.  Development Principles for Dialog-Based Interfaces , 1996, ECAI Workshop on Dialogue Processing in Spoken Language Systems.

[11]  Marilyn A. Walker,et al.  The Effect of Resource Limits and Task Complexity on Collaborative Planning in Dialogue , 1995, Artif. Intell..

[12]  Jon Doyle,et al.  Doyle See Infer Choose Do Perceive Act , 2009 .

[13]  E. Russell Ritenour,et al.  Evaluating spoken dialog systems for telecommunication services , 1997, EUROSPEECH.

[14]  Andrew C. Simpson,et al.  Black box and glass box evaluation of the SUNDIAL system , 1993, EUROSPEECH.

[15]  John B. Kidd,et al.  Decisions with Multiple Objectives—Preferences and Value Tradeoffs , 1977 .

[16]  Lewis M. Norton,et al.  Beyond Class A: A Proposal for Automatic Evaluation of Discourse , 1990, HLT.

[17]  C Kamm,et al.  User Interfaces for voice applications , 1994 .

[18]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[19]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[20]  Madeleine Bates,et al.  A Proposal for Incremental Dialogue Evaluation , 1991, HLT.

[21]  Morena Danieli,et al.  Metrics for Evaluating Dialogue Strategies in a Spoken Language System , 1996, ArXiv.

[22]  Marilyn A. Walker,et al.  Evaluating Response Strategies in a Web-Based Spoken Dialogue Agent , 1998, ACL.

[23]  Allen L. Gorin,et al.  User Interface Issues for Natural Spoken Dialog Systems , 1998 .

[24]  Biing-Hwang Juang,et al.  An Overview of Automatic Speech Recognition , 1996 .

[25]  Roberto Pieraccini,et al.  AMICA: the AT&t mixed initiative conversational architecture , 1997, EUROSPEECH.

[26]  Bonnie L. Webber,et al.  Taking the Initiative in Natural Language Data Base Interactions: Justifying Why , 1982, COLING.

[27]  Lynette Hirschman,et al.  The cost of errors in a spoken language system , 1993, EUROSPEECH.

[28]  Jennifer Chu-Carroll,et al.  Response Generation in Collaborative Negotiation , 1995, ACL.

[29]  Roberto Pieraccini,et al.  A stochastic model of computer-human interaction for learning dialogue strategies , 1997, EUROSPEECH.

[30]  Marilyn A. Walker,et al.  What can I say?: evaluating a spoken language interface to Email , 1998, CHI.

[31]  Abraham Silberschatz,et al.  Database System Concepts , 1980 .

[32]  Marilyn A. Walker,et al.  Evaluating Discourse Processing Algorithms , 1989, ACL.

[33]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[34]  Julia Hirschberg,et al.  A Prosodic Analysis of Discourse Segments in Direction-Giving Monologues , 1996, ACL.

[35]  D. Richard Hipp,et al.  Spoken Natural Language Dialog Systems: A Practical Approach , 1994 .

[36]  Niels Ole Bernsen,et al.  Principles for the design of cooperative spoken human-machine dialogue , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[37]  R. Lathe Phd by thesis , 1988, Nature.

[38]  Raj Reddy,et al.  Large-vocabulary speaker-independent continuous speech recognition: the sphinx system , 1988 .

[39]  David Yarowsky,et al.  Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs , 1992, ACL.

[40]  Julia Hirschberg,et al.  User Participation in the Reasoning Processes of Expert Systems , 1982, AAAI.

[41]  Rebecca J. Passonneau,et al.  Discourse Segmentation by Human and Automated Means , 1997, CL.

[42]  Alexander I. Rudnicky,et al.  Multi-Site Data Collection and Evaluation in Spoken Language Understanding , 1993, HLT.

[43]  Sandra Carberry,et al.  Plan Recognition and Its Use in Understanding Dialog , 1989 .

[44]  Joseph Polifroni,et al.  A form-based dialogue manager for spoken language applications , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[45]  Marilyn A. Walker,et al.  Evaluating competing agent strategies for a voice email agent , 1997, EUROSPEECH.

[46]  David S Pallet Performance assessment of automatic speech recognizers , 1985 .

[47]  C. Raymond Perrault,et al.  Analyzing Intention in Utterances , 1986, Artif. Intell..

[48]  Marilyn A. Walker,et al.  Informational redundancy and resource bounds in dialogue , 1993 .

[49]  Sanjay Rajopadhye User Interface Issues in , 1990 .

[50]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[51]  D. E. Goldberg,et al.  Optimization and Machine Learning , 2022 .

[52]  Victor Zue,et al.  WHEELS: a conversational system in the automobile classifieds domain , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[53]  Marilyn A. Walker,et al.  Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email , 1998, COLING-ACL.

[54]  Victor Zue,et al.  Experiments in Evaluating Interactive Spoken Language Systems , 1992, HLT.

[55]  Elizabeth Shriberg,et al.  Human-Machine Problem Solving Using Spoken Language Systems (SLS): Factors Affecting Performance and User Satisfaction , 1992, HLT.

[56]  TesauroGerald Practical Issues in Temporal Difference Learning , 1992 .