Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach
暂无分享,去创建一个
Bo Dai | Tuo Zhao | Haoming Jiang | Mengjiao Yang | Wei Wei
[1] Yifei Ma,et al. Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.
[2] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[3] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .
[4] Yu-Xiang Wang,et al. Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning , 2020, AISTATS.
[5] Maxine Eskenazi,et al. Unsupervised Evaluation of Interactive Dialog with DialoGPT , 2020, SIGDIAL.
[6] Joelle Pineau,et al. Bootstrapping Dialog Systems with Word Embeddings , 2014 .
[7] Jean Carletta,et al. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.
[8] Sebastian Möller,et al. Memo: towards automatic usability evaluation of spoken dialogue services by user error simulations , 2006, INTERSPEECH.
[9] Dawei Yin,et al. Modeling Topical Relevance for Multi-Turn Dialogue Generation , 2020, IJCAI.
[10] Masatoshi Uehara,et al. Statistically Efficient Off-Policy Policy Gradients , 2020, ICML.
[11] Maja J. Mataric,et al. Reward Functions for Accelerated Learning , 1994, ICML.
[12] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[13] Amanda Stent,et al. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers) , 2018, North American Chapter of the Association for Computational Linguistics.
[14] H. Zha,et al. Reliable Off-policy Evaluation for Reinforcement Learning , 2020, ArXiv.
[15] Qiang Liu,et al. Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.
[16] Lin F. Yang,et al. On Landscape of Lagrangian Functions and Stochastic Search for Constrained Nonconvex Optimization , 2018, ArXiv.
[17] Joelle Pineau,et al. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.
[18] Hoang Minh Le,et al. Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning , 2019, NeurIPS Datasets and Benchmarks.
[19] Jason Weston,et al. ParlAI: A Dialog Research Software Platform , 2017, EMNLP.
[20] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[21] Natasha Jaques,et al. Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.
[22] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..
[23] Anton Leuski,et al. Toward Learning and Evaluation of Dialogue Policies with Text Examples , 2011, SIGDIAL Conference.
[24] Ruosong Wang,et al. What are the Statistical Limits of Offline RL with Linear Function Approximation? , 2020, ICLR.
[25] Bo Dai,et al. GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.
[26] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.
[27] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.
[28] Mamoru Komachi,et al. Machine Translation Evaluation with BERT Regressor , 2019, ArXiv.
[29] Wei Wei,et al. PONE , 2020, ACM Trans. Inf. Syst..
[30] Jianfeng Gao,et al. Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.
[31] Ilya Kostrikov,et al. AlgaeDICE: Policy Gradient from Arbitrary Experience , 2019, ArXiv.
[32] D. Horvitz,et al. A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .
[33] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[34] A. M. Turing,et al. Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.
[35] Masashi Toyoda,et al. uBLEU: Uncertainty-Aware Automatic Evaluation Method for Open-Domain Dialogue Systems , 2020, ACL.
[36] Natasha Jaques,et al. Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems , 2019, NeurIPS.
[37] Vasile Rus,et al. An Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics , 2012, ITS.
[38] Kevin Gimpel,et al. Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.
[39] Yang Xiang,et al. Problematic Situation Analysis and Automatic Recognition for Chinese Online Conversational System , 2014, CIPS-SIGHAN.
[40] Mengdi Wang,et al. Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation , 2020, ICML.
[41] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[42] Jason Weston,et al. Personalizing Dialogue Agents: I have a dog, do you have pets too? , 2018, ACL.
[43] D. Traum,et al. A Semi-automated Evaluation Metric for Dialogue Model Coherence , 2014, IWSDS.
[44] Jinho D. Choi,et al. Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols , 2020, SIGDIAL.
[45] Dongyan Zhao,et al. RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems , 2017, AAAI.
[46] Mirella Lapata,et al. Vector-based Models of Semantic Composition , 2008, ACL.
[47] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[48] Joelle Pineau,et al. Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.
[49] Jason Weston,et al. What makes a good conversation? How controllable attributes affect human judgments , 2019, NAACL.
[50] George R. Doddington,et al. The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.
[51] Bing Liu,et al. Bootstrapping a Neural Conversational Agent with Dialogue Self-Play, Crowdsourcing and On-Line Reinforcement Learning , 2018, NAACL.
[52] Bo Dai,et al. DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.
[53] Quoc V. Le,et al. AirDialogue: An Environment for Goal-Oriented Dialogue Research , 2018, EMNLP.
[54] Zhou Yu,et al. Strategy and Policy Learning for Non-Task-Oriented Conversational Systems , 2016, SIGDIAL Conference.
[55] Quoc V. Le,et al. Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.
[56] Joelle Pineau,et al. The Second Conversational Intelligence Challenge (ConvAI2) , 2019, The NeurIPS '18 Competition.
[57] Xiang Gao,et al. Dialogue Response Ranking Training with Large-Scale Human Feedback Data , 2020, EMNLP.
[58] Hai Zhao,et al. Task-specific Objectives of Pre-trained Language Models for Dialogue Adaptation , 2020, ArXiv.
[59] Nanyun Peng,et al. Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.
[60] Tatsuya Kawahara,et al. Designing Precise and Robust Dialogue Response Evaluators , 2020, ACL.
[61] Arantxa Otegi,et al. Survey on evaluation methods for dialogue systems , 2019, Artificial Intelligence Review.
[62] Masatoshi Uehara,et al. Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning , 2019 .
[63] Osmar R. Zaïane,et al. Evaluating Coherence in Dialogue Systems using Entailment , 2019, NAACL.
[64] Thibault Sellam,et al. BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.
[65] Xiaodan Liang,et al. GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems , 2020, EMNLP.
[66] Le Song,et al. Boosting the Actor with Dual Critic , 2017, ICLR.
[67] Sujay Sanghavi,et al. Nearly Horizon-Free Offline Reinforcement Learning , 2021, ArXiv.
[68] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .
[69] Thomas Wolf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.
[70] Jianfeng Gao,et al. deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets , 2015, ACL.
[71] Bo Dai,et al. Off-Policy Evaluation via the Regularized Lagrangian , 2020, NeurIPS.
[72] Masatoshi Uehara,et al. Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.
[73] Erik Nijkamp,et al. Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation , 2020, ACL.
[74] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.
[75] Mitesh M. Khapra,et al. Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining , 2020, Transactions of the Association for Computational Linguistics.
[76] Robert L. Mercer,et al. An Estimate of an Upper Bound for the Entropy of English , 1992, CL.
[77] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[78] Ryuichiro Higashinaka,et al. Evaluating coherence in open domain conversational systems , 2014, INTERSPEECH.
[79] Antoine Raux,et al. The Dialog State Tracking Challenge , 2013, SIGDIAL Conference.