Empirical Methods for Evaluating Dialog Systems

We examine what purpose a dialog metric serves and then propose empirical methods for evaluating systems that meet that purpose. The methods include a protocol for conducting a wizard-of-oz experiment and a basic set of descriptive statistics for substantiating performance claims using the data collected from the experiment as an ideal benchmark or "gold standard" for comparative judgments. The methods also provide a practical means of optimizing the system through component analysis and cost valuation.

[1]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[2]  H. H. Clark,et al.  Collaborating on contributions to conversations , 1987 .

[3]  Eric Horvitz,et al.  Conversation as Action Under Uncertainty , 2000, UAI.

[4]  Niels Ole Bernsen,et al.  Designing interactive speech systems - from first ideas to user testing , 1998 .

[5]  Victor Zue,et al.  Data collection and performance evaluation of spoken dialogue systems: the MIT experience , 2000, INTERSPEECH.

[6]  Shimei Pan,et al.  Empirically Evaluating an Adaptable Spoken Dialogue System , 1999, ArXiv.

[7]  Morena Danieli,et al.  Metrics for Evaluating Dialogue Strategies in a Spoken Language System , 1996, ArXiv.

[8]  Eric Horvitz Uncertainty, Utility, and Understanding , 2000, Intelligent Tutoring Systems.

[9]  Eric Horvitz,et al.  Uncertainty, Utility, and Misunderstanding: A Decision-Theoretic Perspective on Grounding in Conversational Systems , 1999 .

[10]  Jean-Luc Gauvain,et al.  Considerations in the design and evaluation of spoken language dialog systems , 2000, INTERSPEECH.

[11]  Eric Horvitz,et al.  A computational architecture for conversation , 1999 .

[12]  Gordon Miller,et al.  Decision Making: Descriptive, Normative, and Prescriptive Interactions , 1990 .

[13]  Herbert H. Clark,et al.  Contributing to Discourse , 1989, Cogn. Sci..

[14]  Herbert H. Clark,et al.  Grounding in communication , 1991, Perspectives on socially shared cognition.

[15]  Marilyn A. Walker,et al.  Evaluating spoken dialogue agents with PARADISE: Two case studies , 1998, Comput. Speech Lang..