Metrics and Evaluation of Spoken Dialogue Systems

The ultimate goal of an evaluation framework is to determine a dialogue system’s performance, which can be defined as “the ability of a system to provide the function it has been designed for” [32]. Also important, particularly for industrial systems, is dialogue quality or usability. To measure usability, one can use subjective measures such as User Satisfaction or likelihood of future use. These subjective metrics are difficult to measure and are dependent on the context and the individual user, whose goal and values may differ from other users. This chapter will survey evaluation frameworks and discuss their advantages and disadvantages. We will examine metrics for evaluating system performance and dialogue quality. We will also discuss evaluation techniques that can be used to automatically detect problems in the dialogue, thus filtering out good dialogues and leaving poor dialogues for further evaluation and investigation [62].

[1]  Lynette Hirschman,et al.  The cost of errors in a spoken language system , 1993, EUROSPEECH.

[2]  Elizabeth Shriberg,et al.  Human-Machine Problem Solving Using Spoken Language Systems (SLS): Factors Affecting Performance and User Satisfaction , 1992, HLT.

[3]  Oliver Lemon,et al.  A Two-Tier User Simulation Model for Reinforcement Learning of Adaptive Referring Expression Generation Policies , 2009, SIGDIAL Conference.

[4]  Kallirroi Georgila,et al.  Quantitative Evaluation of User Simulation Techniques for Spoken Dialogue Systems , 2005, SIGDIAL.

[5]  Sebastian Möller,et al.  Memo: towards automatic usability evaluation of spoken dialogue services by user error simulations , 2006, INTERSPEECH.

[6]  Maxine Eskénazi,et al.  Spoken Dialog Challenge 2010: Comparison of Live and Control Test Results , 2011, SIGDIAL Conference.

[7]  Sophie Rosset,et al.  Predictive Performance of Dialog Systems , 2000, LREC.

[8]  Jennifer Balogh,et al.  Voice User Interface Design , 2004 .

[9]  C Kamm,et al.  User Interfaces for voice applications , 1994 .

[10]  Hélène Bonneau-Maynard,et al.  Evaluation of dialog strategies for a tourist information retrieval system , 1998, ICSLP.

[11]  R. L. Keeney,et al.  Decisions with Multiple Objectives: Preferences and Value Trade-Offs , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[12]  Oliver Lemon,et al.  Cluster-based user simulations for learning dialogue strategies , 2006, INTERSPEECH.

[13]  Milica Gasic,et al.  The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management , 2010, Comput. Speech Lang..

[14]  Giuseppe Riccardi,et al.  How may I help you? , 1997, Speech Commun..

[15]  Oliver Lemon,et al.  Learning Effective Multimodal Dialogue Strategies from Wizard-of-Oz Data: Bootstrapping and Evaluation , 2008, ACL.

[16]  Kallirroi Georgila,et al.  Hybrid reinforcement/supervised learning for dialogue policies from COMMUNICATOR data , 2005 .

[17]  Oliver Lemon,et al.  Reinforcement Learning for Adaptive Dialogue Systems - A Data-driven Methodology for Dialogue Management and Natural Language Generation , 2011, Theory and Applications of Natural Language Processing.

[18]  Tim Paek Empirical Methods for Evaluating Dialog Systems , 2001, SIGDIAL Workshop.

[19]  Sebastian Möller,et al.  A Framework for Model-based Evaluation of Spoken Dialog Systems , 2008, SIGDIAL Workshop.

[20]  Roberto Pieraccini,et al.  A stochastic model of human-machine interaction for learning dialog strategies , 2000, IEEE Trans. Speech Audio Process..

[21]  Ramón López-Cózar,et al.  Testing the performance of spoken dialogue systems by means of an artificially simulated user , 2006, Artificial Intelligence Review.

[22]  Tim Paek,et al.  Toward Evaluation that Leads to Best Practices: Reconciling Dialog Evaluation in Research and Industry , 2007, Proceedings of the Workshop on Bridging the Gap Academic and Industrial Research in Dialog Technologies - NAACL-HLT '07.

[23]  Morena Danieli,et al.  Metrics for Evaluating Dialogue Strategies in a Spoken Language System , 1996, ArXiv.

[24]  Marilyn A. Walker,et al.  An Application of Reinforcement Learning to Dialogue Strategy Selection in a Spoken Dialogue System for Email , 2000, J. Artif. Intell. Res..

[25]  Robert Graham,et al.  Towards a tool for the Subjective Assessment of Speech System Interfaces (SASSI) , 2000, Natural Language Engineering.

[26]  Norbert Reithinger,et al.  SpeechEval - Evaluating Spoken Dialog Systems by User Simulation , 2009, IJCAI 2009.

[27]  Oliver Lemon,et al.  Learning to Adapt to Unknown Users: Referring Expression Generation in Spoken Dialogue Systems , 2010, ACL.

[28]  Oliver Lemon,et al.  Learning Adaptive Referring Expression Generation Policies for Spoken Dialogue Systems , 2010, Empirical Methods in Natural Language Generation.

[29]  Jean-Luc Gauvain,et al.  Considerations in the design and evaluation of spoken language dialog systems , 2000, INTERSPEECH.

[30]  David P. Morgan,et al.  How to build a speech recognition application : a style guide for telephony dialogues , 2001 .

[31]  Sebastian Möller,et al.  Modeling User Satisfaction with Hidden Markov Models , 2009, SIGDIAL Conference.

[32]  Sebastian Mller,et al.  Quality of Telephone-Based Spoken Dialogue Systems , 2004 .

[33]  Gregory A. Sanders,et al.  Darpa Communicator Evaluation: Progress from 2000 to 2001 Darpa Communicator Evaluation: Progress from 2000 to 2001 , 2022 .

[34]  H. Cuayahuitl,et al.  Human-computer dialogue simulation using hidden Markov models , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[35]  Gregory A. Sanders,et al.  DARPA communicator dialog travel planning systems: the june 2000 data collection , 2001, INTERSPEECH.

[36]  Gregory A. Sanders,et al.  DARPA communicator: cross-system results for the 2001 evaluation , 2002, INTERSPEECH.

[37]  Marilyn A. Walker,et al.  Quantitative and Qualitative Evaluation of Darpa Communicator Spoken Dialogue Systems , 2001, ACL.

[38]  H. Grice Logic and conversation , 1975 .

[39]  Helen F. Hastie,et al.  What's the Problem: Automatically Identifying Problematic Dialogues in DARPA Communicator Dialogue Systems , 2002, ACL.

[40]  Helen F. Hastie,et al.  A survey on metrics for the evaluation of user simulations , 2012, The Knowledge Engineering Review.

[41]  Markku Turunen,et al.  Subjective evaluation of spoken dialogue systems using SER VQUAL method , 2004, INTERSPEECH.

[42]  Marilyn A. Walker,et al.  Towards developing general models of usability with PARADISE , 2000, Natural Language Engineering.

[43]  Masahiro Araki,et al.  Automatic Evaluation Environment for Spoken Dialogue Systems , 1996, ECAI Workshop on Dialogue Processing in Spoken Language Systems.

[44]  Kallirroi Georgila,et al.  User simulation for spoken dialogue systems: learning and evaluation , 2006, INTERSPEECH.

[45]  David Suendermann-Oeft,et al.  A Handsome Set of Metrics to Measure Utterance Classification Performance in Spoken Dialog Systems , 2009, SIGDIAL Conference.

[46]  David Suendermann-Oeft,et al.  From rule-based to statistical grammars: Continuous improvement of large-scale spoken dialog systems , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  Roberto Pieraccini,et al.  User modeling for spoken dialogue system evaluation , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[48]  Wolfgang Minker,et al.  Modeling and Predicting Quality in Spoken Human-Computer Interaction , 2011, SIGDIAL Conference.

[49]  Jeremy H. Wright,et al.  Automatically Training a Problematic Dialogue Predictor for a Spoken Dialogue System , 2011, J. Artif. Intell. Res..

[50]  Marilyn A. Walker Can We Talk? Methods for Evaluation and Training of Spoken Dialogue Systems , 2005, Lang. Resour. Evaluation.

[51]  Lin-Shan Lee,et al.  Computer-aided analysis and design for spoken dialogue systems based on quantitative simulations , 2001, IEEE Trans. Speech Audio Process..

[52]  Marilyn A. Walker,et al.  THE UTILITY OF ELAPSED TIME AS A USABILITY METRIC FOR SPOKEN DIALOGUE SYSTEMS , 2007 .

[53]  Oliver Lemon,et al.  Automatic Learning and Evaluation of User-Centered Objective Functions for Dialogue System Optimisation , 2008, LREC.

[54]  Maxine Eskénazi,et al.  Spoken Dialog Challenge 2010 , 2010, 2010 IEEE Spoken Language Technology Workshop.

[55]  Hua Ai,et al.  Assessing Dialog System User Simulation Evaluation Measures Using Human Judges , 2008, ACL.

[56]  Sebastian Möller,et al.  Analysis of a new simulation approach to dialog system evaluation , 2009, Speech Commun..