Evaluating Dialogue Generation Systems via Response Selection

Existing automatic evaluation metrics for open-domain dialogue response generation systems correlate poorly with human evaluation. We focus on evaluating response generation systems via response selection. To evaluate systems properly via response selection, we propose the method to construct response selection test sets with well-chosen false candidates. Specifically, we propose to construct test sets filtering out some types of false candidates: (i) those unrelated to the ground-truth response and (ii) those acceptable as appropriate responses. Through experiments, we demonstrate that evaluating systems via response selection with the test sets developed by our method correlates more strongly with human evaluation, compared with widely used automatic evaluation metrics such as BLEU.

[1]  Joelle Pineau,et al.  Bootstrapping Dialog Systems with Word Embeddings , 2014 .

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Jörg Tiedemann,et al.  OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora , 2018, LREC.

[4]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[5]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[6]  Joelle Pineau,et al.  Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.

[7]  Zhoujun Li,et al.  Sequential Match Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots , 2016, ArXiv.

[8]  Xiaoyu Shen,et al.  DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset , 2017, IJCNLP.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[11]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[12]  Maxine Eskénazi,et al.  Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders , 2017, ACL.

[13]  Vasile Rus,et al.  A Comparison of Greedy and Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics , 2012, BEA@NAACL-HLT.

[14]  Joelle Pineau,et al.  The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems , 2015, SIGDIAL Conference.

[15]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[16]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[17]  Mitesh M. Khapra,et al.  Re-evaluating ADEM: A Deeper Look at Scoring Dialogue Responses , 2019, AAAI.

[18]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[19]  Walter S. Lasecki,et al.  DSTC7 Task 1: Noetic End-to-End Response Selection , 2019, Proceedings of the First Workshop on NLP for Conversational AI.

[20]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.