Overview of Robust and Multilingual Automatic Evaluation Metrics for Open-Domain Dialogue Systems at DSTC 11 Track 4
暂无分享,去创建一个
Alexander I. Rudnicky | João Sedoc | Sarik Ghazarian | L. F. D’Haro | Chengguang Tang | Chen Zhang | Ke Shi | Mario Rodr'iguez-Cantelar
[1] H. A. Schwartz,et al. Human-Centered Metrics for Dialog System Evaluation , 2023, ArXiv.
[2] Haizhou Li,et al. FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation , 2022, EMNLP.
[3] Nanyun Peng,et al. EnDex: Evaluation of Dialogue Engagingness at Scale , 2022, EMNLP.
[4] Xipeng Qiu,et al. BERTScore is Unfair: On Social Bias in Language Model-Based Metrics for Text Generation , 2022, EMNLP.
[5] Eric Michael Smith,et al. BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage , 2022, ArXiv.
[6] Dilek Z. Hakkani-Tür,et al. Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges , 2022, ArXiv.
[7] Luis Fernando D'Haro,et al. MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation , 2021, AAAI.
[8] Alexander I. Rudnicky,et al. Automatic Evaluation and Moderation of Open-domain Dialogue Systems , 2021, ArXiv.
[9] Hua Wu,et al. PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation , 2021, AACL/IJCNLP.
[10] Maxine Eskenazi,et al. A Comprehensive Assessment of Dialog Evaluation Metrics , 2021, EANCS.
[11] Mohit Bansal,et al. I like fish, especially dolphins: Addressing Contradictions in Dialogue Modeling , 2020, ACL.
[12] Alon Lavie,et al. COMET: A Neural Framework for MT Evaluation , 2020, EMNLP.
[13] Vahid Behzadan,et al. Sentimental LIAR: Extended Corpus and Deep Learning Models for Fake Claim Classification , 2020, 2020 IEEE International Conference on Intelligence and Security Informatics (ISI).
[14] Minlie Huang,et al. A Large-Scale Chinese Short-Text Conversation Dataset , 2020, NLPCC.
[15] Maxine Eskenazi,et al. Unsupervised Evaluation of Interactive Dialog with DialoGPT , 2020, SIGDIAL.
[16] Giuseppe Riccardi,et al. Is this Dialogue Coherent? Learning from Dialogue Acts and Entities , 2020, SIGDIAL.
[17] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[18] Maxine Eskenazi,et al. USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation , 2020, ACL.
[19] Tatsuya Kawahara,et al. Designing Precise and Robust Dialogue Response Evaluators , 2020, ACL.
[20] Thibault Sellam,et al. BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.
[21] Minlie Huang,et al. KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation , 2020, ACL.
[22] Quoc V. Le,et al. Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.
[23] Andreas Holzinger,et al. Human Annotated Dialogues Dataset for Natural Conversational Agents , 2020, Applied Sciences.
[24] Myle Ott,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.
[25] Jianfeng Gao,et al. DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation , 2019, ACL.
[26] Dilek Z. Hakkani-Tür,et al. Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations , 2019, INTERSPEECH.
[27] Maxine Eskénazi,et al. Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References , 2019, SIGdial.
[28] Chris Callison-Burch,et al. ChatEval: A Tool for Chatbot Evaluation , 2019, NAACL.
[29] Arantxa Otegi,et al. Survey on evaluation methods for dialogue systems , 2019, Artificial Intelligence Review.
[30] Jianfeng Gao,et al. Multi-Domain Task-Completion Dialog Challenge , 2019 .
[31] Jason Weston,et al. What makes a good conversation? How controllable attributes affect human judgments , 2019, NAACL.
[32] Harry Shum,et al. The Design and Implementation of XiaoIce, an Empathetic Social Chatbot , 2018, CL.
[33] Y-Lan Boureau,et al. Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset , 2018, ACL.
[34] Rada Mihalcea,et al. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.
[35] J. Weston,et al. Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.
[36] Mitesh M. Khapra,et al. Towards Exploiting Background Knowledge for Building Conversation Systems , 2018, EMNLP.
[37] Alan W. Black,et al. A Dataset for Document Grounded Conversations , 2018, EMNLP.
[38] Hai Zhao,et al. Modeling Multi-turn Conversation with Deep Utterance Aggregation , 2018, COLING.
[39] Lun-Wei Ku,et al. EmotionLines: An Emotion Corpus of Multi-Party Conversations , 2018, LREC.
[40] Jason Weston,et al. Personalizing Dialogue Agents: I have a dog, do you have pets too? , 2018, ACL.
[41] Xiaoyu Shen,et al. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset , 2017, IJCNLP.
[42] Minlie Huang,et al. Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory , 2017, AAAI.
[43] Zhoujun Li,et al. Sequential Match Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots , 2016, ArXiv.
[44] J. Clemente,et al. Intestinal Microbiota Is Influenced by Gender and Body Mass Index , 2016, PloS one.
[45] Yuka Kobayashi,et al. The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics , 2016, LREC.
[46] Joelle Pineau,et al. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.
[47] Hang Li,et al. Neural Responding Machine for Short-Text Conversation , 2015, ACL.
[48] Rafael E. Banchs. Movie-DiC: a Movie Dialogue Corpus for Research and Development , 2012, ACL.
[49] Cristian Danescu-Niculescu-Mizil,et al. Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs , 2011, CMCL@ACL.
[50] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[51] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[52] Khyathi Raghavi Chandu,et al. Needle in a Haystack: An Analysis of Finding Qualified Workers on MTurk for Summarization , 2022, ArXiv.
[53] Haizhou Li,et al. Deep AM-FM: Toolkit for Automatic Dialogue Evaluation , 2020, IWSDS.
[54] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[55] Bill Dolan,et al. Grounded Response Generation Task at DSTC7 , 2019 .
[56] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[57] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .