A method for evaluating and comparing user simulations: The Cramér-von Mises divergence

Although user simulations are increasingly employed in the development and assessment of spoken dialog systems, there is no accepted method for evaluating user simulations. In this paper, we propose a novel quality measure for user simulations. We view a user simulation as a predictor of the performance of a dialog system, where per-dialog performance is measured with a domain-specific scoring function. The quality of the user simulation is measured as the divergence between the distribution of scores in real dialogs and simulated dialogs, and we argue that the Cramer-von Mises divergence is well-suited to this task. The technique is demonstrated on a corpus of real calls, and we present a table of critical values for practitioners to interpret the statistical significance of comparisons between user simulations.

[1]  F. James Statistical Methods in Experimental Physics , 1973 .

[2]  Baining Guo,et al.  Spoken dialogue management as planning and acting under uncertainty , 2001, INTERSPEECH.

[3]  H. Cramér On the composition of elementary errors: Second paper: Statistical applications , 1928 .

[4]  Roberto Pieraccini,et al.  A stochastic model of human-machine interaction for learning dialog strategies , 2000, IEEE Trans. Speech Audio Process..

[5]  Kallirroi Georgila,et al.  Quantitative Evaluation of User Simulation Techniques for Spoken Dialogue Systems , 2005, SIGDIAL.

[6]  T. W. Anderson On the Distribution of the Two-Sample Cramer-von Mises Criterion , 1962 .

[7]  H. Cuayahuitl,et al.  Human-computer dialogue simulation using hidden Markov models , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[8]  Steve Young,et al.  Automatic learning of dialogue strategy using dialogue simulation and reinforcement learning , 2002 .

[9]  Kallirroi Georgila,et al.  Hybrid reinforcement/supervised learning for dialogue policies from COMMUNICATOR data , 2005 .

[10]  Roberto Pieraccini,et al.  VALUE-BASED OPTIMAL DECISION FOR DIALOG SYSTEMS , 2006, 2006 IEEE Spoken Language Technology Workshop.

[11]  Olivier Pietquin,et al.  A Framework for Unsupervised Learning of Dialogue Strategies , 2004 .

[12]  Joelle Pineau,et al.  Spoken Dialogue Management Using Probabilistic Reasoning , 2000, ACL.

[13]  Hui Ye,et al.  The Hidden Information State Approach to Dialog Management , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[14]  Jason D. Williams,et al.  Partially Observable Markov Decision Processes for Spoken Dialogue Management , 2006 .

[15]  Marilyn A. Walker,et al.  Towards developing general models of usability with PARADISE , 2000, Natural Language Engineering.

[16]  M. A. Stephens,et al.  Introduction to Kolmogorov (1933) On the Empirical Determination of a Distribution , 1992 .

[17]  S. Singh,et al.  Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System , 2011, J. Artif. Intell. Res..

[18]  Richard Von Mises,et al.  Wahrscheinlichkeitsrechnung und ihre Anwendung in der Statistik und theoretischen Physik , 1931 .

[19]  Anton Nijholt,et al.  A tractable DDN-POMDP Approach to Affective Dialogue Modeling for General Probabilistic Frame-based Dialogue Systems , 2007 .