Evaluating spoken dialogue systems according to de-facto standards: A case study

In the present paper, we investigate the validity and reliability of de-facto evaluation standards, defined for measuring or predicting the quality of the interaction with spoken dialogue systems. Two experiments have been carried out with a dialogue system for controlling domestic devices. During these experiments, subjective judgments of quality have been collected by two questionnaire methods (ITU-T Rec. P.851 and SASSI), and parameters describing the interaction have been logged and annotated. Both metrics served the derivation of prediction models according to the PARADISE approach. Although the limited database allows only tentative conclusions to be drawn, the results suggest that both questionnaire methods provide valid measurements of a large number of different quality aspects; most of the perceptive dimensions underlying the subjective judgments can also be measured with a high reliability. The extracted parameters mainly describe quality aspects which are directly linked to the system, environmental and task characteristics. Used as an input to prediction models, the parameters provide helpful information for system design and optimization, but not general predictions of system usability and acceptability. © 2005 Elsevier Ltd. All rights reserved.

[1]  Marilyn A. Walker,et al.  Evaluating spoken dialogue agents with PARADISE: Two case studies , 1998, Comput. Speech Lang..

[2]  Victor Zue,et al.  Experiments in Evaluating Interactive Spoken Language Systems , 1992, HLT.

[3]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[4]  Morena Danieli,et al.  Metrics for Evaluating Dialogue Strategies in a Spoken Language System , 1996, ArXiv.

[5]  Roger K. Moore,et al.  Handbook of Multimodal and Spoken Dialogue Systems: Resources, Terminology and Product Evaluation , 2000 .

[6]  Martin Rajman,et al.  Assessing the usability of a dialogue management system designed in the framework of a rapid dialogu , 2003 .

[7]  A. Parasuraman,et al.  SERVQUAL: A multiple-item scale for measuring consumer perceptions of service quality. , 1988 .

[8]  Victor Zue,et al.  Data collection and performance evaluation of spoken dialogue systems: the MIT experience , 2000, INTERSPEECH.

[9]  Robert Graham,et al.  Subjective assessment of speech-system interface usability , 2001, INTERSPEECH.

[10]  Andrew C. Simpson,et al.  Black box and glass box evaluation of the SUNDIAL system , 1993, EUROSPEECH.

[11]  Nigel Gilbert,et al.  Simulating speech systems , 1991 .

[12]  Markku Turunen,et al.  Subjective evaluation of spoken dialogue systems using SER VQUAL method , 2004, INTERSPEECH.

[13]  Niels Ole Bernsen,et al.  Usability issues in spoken dialogue systems , 2000, Natural Language Engineering.

[14]  K. Á. T.,et al.  Towards a tool for the Subjective Assessment of Speech System Interfaces (SASSI) , 2000, Natural Language Engineering.

[15]  Marilyn A. Walker,et al.  Towards developing general models of usability with PARADISE , 2000, Natural Language Engineering.

[16]  Niels Ole Bernsen,et al.  Principles for the design of cooperative spoken human-machine dialogue , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[17]  Dafydd Gibbon,et al.  Assessment of interactive systems. , 1998 .

[18]  Niels Ole Bernsen,et al.  Evaluation and usability of multimodal spoken language dialogue systems , 2004, Speech Commun..

[19]  Dafydd Gibbon,et al.  Consumer off-the-shelf (COTS) speech technology product and service evaluation , 2000 .

[20]  Sebastian Möller A new Taxonomy for the Quality of Telephone Services Based on Spoken Dialogue Systems , 2002, SIGDIAL Workshop.

[21]  Sebastian Mller,et al.  Quality of Telephone-Based Spoken Dialogue Systems , 2004 .

[22]  Arne Jönsson,et al.  Wizard of Oz studies -- why and how , 1993, Knowl. Based Syst..

[23]  Bernhard Suhm,et al.  Towards best practices for speech user interface design , 2003, INTERSPEECH.

[24]  Sebastian Möller,et al.  An analysis of quality prediction models for telephone-based spoken dialogue services , 2004 .

[25]  Lynette Hirschman,et al.  Overview of evaluation in speech and natural language processing , 1997 .

[26]  Elizabeth Shriberg,et al.  Subject-Based Evaluation Measures for Interactive Spoken Language Systems , 1992, HLT.

[27]  Arne Jönsson,et al.  Wizard of Oz studies: why and how , 1993, IUI '93.

[28]  Sebastian Möller,et al.  Quality of Telephone-Based Spoken Dialogue Systems , 2005 .

[29]  Gregory A. Sanders,et al.  Darpa Communicator Evaluation: Progress from 2000 to 2001 Darpa Communicator Evaluation: Progress from 2000 to 2001 , 2022 .

[30]  Niels Ole Bernsen,et al.  Designing interactive speech systems - from first ideas to user testing , 1998 .