Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

Reliable automatic evaluation of dialogue systems under an interactive environment has long been overdue. An ideal environment for evaluating dialog systems, also known as the Turing test, needs to involve human interaction, which is usually not affordable for large scale experiments. Though researchers have attempted to use metrics for language generation tasks (e.g., perplexity, BLEU) or some model-based reinforcement learning methods (e.g., self-play evaluation) for automatic evaluation, these methods only show very weak correlation with the actual human evaluation in practice. To bridge such a gap, we propose a new framework named ENIGMA for estimating human evaluation scores based on recent advances of off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data (see details in Section 2), which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.

[1]  Yifei Ma,et al.  Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[2]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[3]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[4]  Yu-Xiang Wang,et al.  Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning , 2020, AISTATS.

[5]  Maxine Eskenazi,et al.  Unsupervised Evaluation of Interactive Dialog with DialoGPT , 2020, SIGDIAL.

[6]  Joelle Pineau,et al.  Bootstrapping Dialog Systems with Word Embeddings , 2014 .

[7]  Jean Carletta,et al.  Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.

[8]  Sebastian Möller,et al.  Memo: towards automatic usability evaluation of spoken dialogue services by user error simulations , 2006, INTERSPEECH.

[9]  Dawei Yin,et al.  Modeling Topical Relevance for Multi-Turn Dialogue Generation , 2020, IJCAI.

[10]  Masatoshi Uehara,et al.  Statistically Efficient Off-Policy Policy Gradients , 2020, ICML.

[11]  Maja J. Mataric,et al.  Reward Functions for Accelerated Learning , 1994, ICML.

[12]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[13]  Amanda Stent,et al.  Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers) , 2018, North American Chapter of the Association for Computational Linguistics.

[14]  H. Zha,et al.  Reliable Off-policy Evaluation for Reinforcement Learning , 2020, ArXiv.

[15]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[16]  Lin F. Yang,et al.  On Landscape of Lagrangian Functions and Stochastic Search for Constrained Nonconvex Optimization , 2018, ArXiv.

[17]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[18]  Hoang Minh Le,et al.  Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning , 2019, NeurIPS Datasets and Benchmarks.

[19]  Jason Weston,et al.  ParlAI: A Dialog Research Software Platform , 2017, EMNLP.

[20]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[21]  Natasha Jaques,et al.  Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.

[22]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[23]  Anton Leuski,et al.  Toward Learning and Evaluation of Dialogue Policies with Text Examples , 2011, SIGDIAL Conference.

[24]  Ruosong Wang,et al.  What are the Statistical Limits of Offline RL with Linear Function Approximation? , 2020, ICLR.

[25]  Bo Dai,et al.  GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.

[26]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[27]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[28]  Mamoru Komachi,et al.  Machine Translation Evaluation with BERT Regressor , 2019, ArXiv.

[29]  Wei Wei,et al.  PONE , 2020, ACM Trans. Inf. Syst..

[30]  Jianfeng Gao,et al.  Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[31]  Ilya Kostrikov,et al.  AlgaeDICE: Policy Gradient from Arbitrary Experience , 2019, ArXiv.

[32]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[33]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[35]  Masashi Toyoda,et al.  uBLEU: Uncertainty-Aware Automatic Evaluation Method for Open-Domain Dialogue Systems , 2020, ACL.

[36]  Natasha Jaques,et al.  Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems , 2019, NeurIPS.

[37]  Vasile Rus,et al.  An Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics , 2012, ITS.

[38]  Kevin Gimpel,et al.  Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[39]  Yang Xiang,et al.  Problematic Situation Analysis and Automatic Recognition for Chinese Online Conversational System , 2014, CIPS-SIGHAN.

[40]  Mengdi Wang,et al.  Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation , 2020, ICML.

[41]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[42]  Jason Weston,et al.  Personalizing Dialogue Agents: I have a dog, do you have pets too? , 2018, ACL.

[43]  D. Traum,et al.  A Semi-automated Evaluation Metric for Dialogue Model Coherence , 2014, IWSDS.

[44]  Jinho D. Choi,et al.  Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols , 2020, SIGDIAL.

[45]  Dongyan Zhao,et al.  RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems , 2017, AAAI.

[46]  Mirella Lapata,et al.  Vector-based Models of Semantic Composition , 2008, ACL.

[47]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[48]  Joelle Pineau,et al.  Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.

[49]  Jason Weston,et al.  What makes a good conversation? How controllable attributes affect human judgments , 2019, NAACL.

[50]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[51]  Bing Liu,et al.  Bootstrapping a Neural Conversational Agent with Dialogue Self-Play, Crowdsourcing and On-Line Reinforcement Learning , 2018, NAACL.

[52]  Bo Dai,et al.  DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[53]  Quoc V. Le,et al.  AirDialogue: An Environment for Goal-Oriented Dialogue Research , 2018, EMNLP.

[54]  Zhou Yu,et al.  Strategy and Policy Learning for Non-Task-Oriented Conversational Systems , 2016, SIGDIAL Conference.

[55]  Quoc V. Le,et al.  Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.

[56]  Joelle Pineau,et al.  The Second Conversational Intelligence Challenge (ConvAI2) , 2019, The NeurIPS '18 Competition.

[57]  Xiang Gao,et al.  Dialogue Response Ranking Training with Large-Scale Human Feedback Data , 2020, EMNLP.

[58]  Hai Zhao,et al.  Task-specific Objectives of Pre-trained Language Models for Dialogue Adaptation , 2020, ArXiv.

[59]  Nanyun Peng,et al.  Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.

[60]  Tatsuya Kawahara,et al.  Designing Precise and Robust Dialogue Response Evaluators , 2020, ACL.

[61]  Arantxa Otegi,et al.  Survey on evaluation methods for dialogue systems , 2019, Artificial Intelligence Review.

[62]  Masatoshi Uehara,et al.  Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning , 2019 .

[63]  Osmar R. Zaïane,et al.  Evaluating Coherence in Dialogue Systems using Entailment , 2019, NAACL.

[64]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[65]  Xiaodan Liang,et al.  GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems , 2020, EMNLP.

[66]  Le Song,et al.  Boosting the Actor with Dual Critic , 2017, ICLR.

[67]  Sujay Sanghavi,et al.  Nearly Horizon-Free Offline Reinforcement Learning , 2021, ArXiv.

[68]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[69]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[70]  Jianfeng Gao,et al.  deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets , 2015, ACL.

[71]  Bo Dai,et al.  Off-Policy Evaluation via the Regularized Lagrangian , 2020, NeurIPS.

[72]  Masatoshi Uehara,et al.  Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[73]  Erik Nijkamp,et al.  Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation , 2020, ACL.

[74]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[75]  Mitesh M. Khapra,et al.  Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining , 2020, Transactions of the Association for Computational Linguistics.

[76]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[77]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[78]  Ryuichiro Higashinaka,et al.  Evaluating coherence in open domain conversational systems , 2014, INTERSPEECH.

[79]  Antoine Raux,et al.  The Dialog State Tracking Challenge , 2013, SIGDIAL Conference.