On Evaluating and Comparing Conversational Agents

Conversational agents are exploding in popularity. However, much work remains in the area of non goal-oriented conversations, despite significant growth in research interest over recent years. To advance the state of the art in conversational AI, Amazon launched the Alexa Prize, a 2.5-million dollar university competition where sixteen selected university teams built conversational agents to deliver the best social conversational experience. Alexa Prize provided the academic community with the unique opportunity to perform research with a live system used by millions of users. The subjectivity associated with evaluating conversations is key element underlying the challenge of building non-goal oriented dialogue systems. In this paper, we propose a comprehensive evaluation strategy with multiple metrics designed to reduce subjectivity by selecting metrics which correlate well with human judgement. The proposed metrics provide granular analysis of the conversational agents, which is not captured in human ratings. We show that these metrics can be used as a reasonable proxy for human judgment. We provide a mechanism to unify the metrics for selecting the top performing agents, which has also been applied throughout the Alexa Prize competition. To our knowledge, to date it is the largest setting for evaluating agents with millions of conversations and hundreds of thousands of ratings from users. We believe that this work is a step towards an automatic evaluation process for conversational AIs.

[1]  Alan Ritter,et al.  Adversarial Learning for Neural Dialogue Generation , 2017, EMNLP.

[2]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[3]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  Joseph Weizenbaum,et al.  and Machine , 1977 .

[6]  Michael White,et al.  Further Meta-Evaluation of Broad-Coverage Surface Realization , 2010, EMNLP.

[7]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[8]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[9]  Aoife Cahill Correlating Human and Automatic Evaluation of a German Surface Realiser , 2009, ACL/IJCNLP.

[10]  Hector J. Levesque,et al.  Common Sense, the Turing Test, and the Quest for Real AI , 2017 .

[11]  Allison Woodruff,et al.  Detecting user engagement in everyday conversations , 2004, INTERSPEECH.

[12]  Timothy Baldwin,et al.  Accurate Evaluation of Segment-level Machine Translation Metrics , 2015, NAACL.

[13]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[14]  Eric Steven Atwell,et al.  Different measurement metrics to evaluate a chatbot system , 2007, HLT-NAACL 2007.

[15]  Jennifer Chu-Carroll,et al.  MIMIC: An Adaptive Mixed Initiative Spoken Dialogue System for Information Queries , 2000, ANLP.

[16]  Rebecca Hwa,et al.  Regression for Sentence-Level MT Evaluation with Pseudo References , 2007, ACL.

[17]  Joelle Pineau,et al.  A Survey of Available Corpora for Building Data-Driven Dialogue Systems , 2015, Dialogue Discourse.

[18]  James F. Allen,et al.  TRAINS-95: Towards a Mixed-Initiative Planning Assistant , 1996, AIPS.

[19]  Joelle Pineau,et al.  Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.

[20]  B. M. Bennett On an Approximate Test for Homogeneity of Coefficients of Variation , 1976 .

[21]  Ryuichiro Higashinaka,et al.  Evaluating coherence in open domain conversational systems , 2014, INTERSPEECH.

[22]  Gisela Redeker,et al.  On differences between spoken and written language , 1984 .

[23]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[24]  Timothy Baldwin,et al.  Can machine translation systems be evaluated by the crowd alone , 2015, Natural Language Engineering.

[25]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[26]  K. Á. T.,et al.  Towards a tool for the Subjective Assessment of Speech System Interfaces (SASSI) , 2000, Natural Language Engineering.

[27]  Josef van Genabith,et al.  ReVal: A Simple and Effective Machine Translation Evaluation Metric Based on Recurrent Neural Networks , 2015, EMNLP.

[28]  Philipp Koehn,et al.  Findings of the 2011 Workshop on Statistical Machine Translation , 2011, WMT@EMNLP.

[29]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.