Real User Evaluation of Spoken Dialogue Systems Using Amazon Mechanical Turk

This paper describes a framework for evaluation of spoken dialogue systems. Typically, evaluation of dialogue systems is performed in a controlled test environment with carefully selected and instructed users. However, this approach is very demanding. An alternative is to recruit a large group of users who evaluate the dialogue systems in a remote setting under virtually no supervision. Crowdsourcing technology, for example Amazon Mechanical Turk (AMT), provides an efficient way of recruiting subjects. This paper describes an evaluation framework for spoken dialogue systems using AMT users and compares the obtained results with a recent trial in which the systems were tested by locally recruited users. The results suggest that the use of crowdsourcing technology is feasible and it can provide reliable results. Copyright © 2011 ISCA.

[1]  Milica Gasic,et al.  Phrase-Based Statistical Language Generation Using Graphical Models and Active Learning , 2010, ACL.

[2]  Steve J. Young,et al.  Bayesian update of dialogue state: A POMDP framework for spoken dialogue systems , 2010, Comput. Speech Lang..

[3]  Fabrice Lefèvre,et al.  Back-off action selection in summary space-based POMDP dialogue systems , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[4]  Milica Gasic,et al.  The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management , 2010, Comput. Speech Lang..

[5]  Maxine Eskénazi,et al.  Toward better crowdsourced transcription: Transcription of a year of the Let's Go Bus Information System data , 2010, 2010 IEEE Spoken Language Technology Workshop.

[6]  Matthieu Geist,et al.  Managing uncertainty within the ktd framework , 2011 .

[7]  Yi Zhu,et al.  Collection of user judgments on spoken dialog system with crowdsourcing , 2010, 2010 IEEE Spoken Language Technology Workshop.

[8]  David A. Forsyth,et al.  Utility data annotation with Amazon Mechanical Turk , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[9]  Steve J. Young,et al.  Natural actor and belief critic: Reinforcement algorithm for learning parameters of dialogue systems modelled as POMDPs , 2011, TSLP.

[10]  Alexander I. Rudnicky,et al.  Using the Amazon Mechanical Turk for transcription of spoken language , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Milica Gasic,et al.  Parameter estimation for agenda-based user simulation , 2010, SIGDIAL Conference.