论文信息 - ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons - 字舞流文

ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons

While dialogue remains an important end-goal of natural language research, the difficulty of evaluation is an oft-quoted reason why it remains troublesome to make real progress towards its solution. Evaluation difficulties are actually two-fold: not only do automatic metrics not correlate well with human judgments, but also human judgments themselves are in fact difficult to measure. The two most used human judgment tests, single-turn pairwise evaluation and multi-turn Likert scores, both have serious flaws as we discuss in this work. We instead provide a novel procedure involving comparing two full dialogues, where a human judge is asked to pay attention to only one speaker within each, and make a pairwise judgment. The questions themselves are optimized to maximize the robustness of judgments across different annotators, resulting in better tests. We also show how these tests work in self-play model chat setups, resulting in faster, cheaper tests. We hope these tests become the de facto standard, and will release open-source code to that end.

Jason Weston | Stephen Roller | Margaret Li | J. Weston | Stephen Roller | Margaret Li

[1] Rahul Goel,et al. On Evaluating and Comparing Conversational Agents , 2018, ArXiv.

[2] Alan Ritter,et al. Adversarial Learning for Neural Dialogue Generation , 2017, EMNLP.

[3] Jason Weston,et al. What makes a good conversation? How controllable attributes affect human judgments , 2019, NAACL.

[4] David Vandyke,et al. A Network-based End-to-End Trainable Task-oriented Dialogue System , 2016, EACL.

[5] Hannes Schulz,et al. Frames: a corpus for adding memory to goal-oriented dialogue systems , 2017, SIGDIAL Conference.

[6] Jason Weston,et al. Importance of a Search Strategy in Neural Dialogue Modelling , 2018, ArXiv.

[7] Quoc V. Le,et al. A Neural Conversational Model , 2015, ArXiv.

[8] Jianfeng Gao,et al. A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[9] Verena Rieser,et al. RankME: Reliable Human Ratings for Natural Language Generation , 2018, NAACL.

[10] Kyunghyun Cho,et al. Importance of Search and Evaluation Strategies in Neural Dialogue Modeling , 2018, INLG.

[11] Jason Weston,et al. Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.

[12] Joelle Pineau,et al. Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.

[13] Verena Rieser,et al. Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[14] Jason Weston,et al. Learning End-to-End Goal-Oriented Dialog , 2016, ICLR.

[15] Jianfeng Gao,et al. A Persona-Based Neural Conversation Model , 2016, ACL.

[16] Joelle Pineau,et al. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[17] Jason Weston,et al. Personalizing Dialogue Agents: I have a dog, do you have pets too? , 2018, ACL.

[18] Oliver Lemon,et al. Data-Driven Methods for Adaptive Spoken Dialogue Systems , 2012, Springer New York.

[19] Joelle Pineau,et al. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[20] Jason Weston,et al. Key-Value Memory Networks for Directly Reading Documents , 2016, EMNLP.

[21] Thomas Wolf,et al. TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents , 2019, ArXiv.

[22] Matthew Henderson,et al. The Second Dialog State Tracking Challenge , 2014, SIGDIAL Conference.

[23] Helen Hastie,et al. Metrics and Evaluation of Spoken Dialogue Systems , 2012 .

[24] Jianfeng Gao,et al. Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[25] Jason Weston,et al. Real-time Inference in Multi-sentence Tasks with Deep Pretrained Transformers , 2019, ArXiv.

[26] Joelle Pineau,et al. Extending Neural Generative Conversational Model using External Knowledge Sources , 2018, EMNLP.

[27] Joelle Pineau,et al. The Second Conversational Intelligence Challenge (ConvAI2) , 2019, The NeurIPS '18 Competition.

[28] Natasha Jaques,et al. Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems , 2019, NeurIPS.