Overview of Robust and Multilingual Automatic Evaluation Metrics for Open-Domain Dialogue Systems at DSTC 11 Track 4

The advent and fast development of neural networks have revolutionized the research on dialogue systems and subsequently have triggered various challenges regarding their automatic evaluation. Automatic evaluation of open-domain dialogue systems as an open challenge has been the center of the attention of many researchers. Despite the consistent efforts to improve automatic metrics' correlations with human evaluation, there have been very few attempts to assess their robustness over multiple domains and dimensions. Also, their focus is mainly on the English language. All of these challenges prompt the development of automatic evaluation metrics that are reliable in various domains, dimensions, and languages. This track in the 11th Dialogue System Technology Challenge (DSTC11) is part of the ongoing effort to promote robust and multilingual automatic evaluation metrics. This article describes the datasets and baselines provided to participants and discusses the submission and result details of the two proposed subtasks.

[1]  H. A. Schwartz,et al.  Human-Centered Metrics for Dialog System Evaluation , 2023, ArXiv.

[2]  Haizhou Li,et al.  FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation , 2022, EMNLP.

[3]  Nanyun Peng,et al.  EnDex: Evaluation of Dialogue Engagingness at Scale , 2022, EMNLP.

[4]  Xipeng Qiu,et al.  BERTScore is Unfair: On Social Bias in Language Model-Based Metrics for Text Generation , 2022, EMNLP.

[5]  Eric Michael Smith,et al.  BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage , 2022, ArXiv.

[6]  Dilek Z. Hakkani-Tür,et al.  Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges , 2022, ArXiv.

[7]  Luis Fernando D'Haro,et al.  MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation , 2021, AAAI.

[8]  Alexander I. Rudnicky,et al.  Automatic Evaluation and Moderation of Open-domain Dialogue Systems , 2021, ArXiv.

[9]  Hua Wu,et al.  PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation , 2021, AACL/IJCNLP.

[10]  Maxine Eskenazi,et al.  A Comprehensive Assessment of Dialog Evaluation Metrics , 2021, EANCS.

[11]  Mohit Bansal,et al.  I like fish, especially dolphins: Addressing Contradictions in Dialogue Modeling , 2020, ACL.

[12]  Alon Lavie,et al.  COMET: A Neural Framework for MT Evaluation , 2020, EMNLP.

[13]  Vahid Behzadan,et al.  Sentimental LIAR: Extended Corpus and Deep Learning Models for Fake Claim Classification , 2020, 2020 IEEE International Conference on Intelligence and Security Informatics (ISI).

[14]  Minlie Huang,et al.  A Large-Scale Chinese Short-Text Conversation Dataset , 2020, NLPCC.

[15]  Maxine Eskenazi,et al.  Unsupervised Evaluation of Interactive Dialog with DialoGPT , 2020, SIGDIAL.

[16]  Giuseppe Riccardi,et al.  Is this Dialogue Coherent? Learning from Dialogue Acts and Entities , 2020, SIGDIAL.

[17]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[18]  Maxine Eskenazi,et al.  USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation , 2020, ACL.

[19]  Tatsuya Kawahara,et al.  Designing Precise and Robust Dialogue Response Evaluators , 2020, ACL.

[20]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[21]  Minlie Huang,et al.  KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation , 2020, ACL.

[22]  Quoc V. Le,et al.  Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.

[23]  Andreas Holzinger,et al.  Human Annotated Dialogues Dataset for Natural Conversational Agents , 2020, Applied Sciences.

[24]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[25]  Jianfeng Gao,et al.  DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation , 2019, ACL.

[26]  Dilek Z. Hakkani-Tür,et al.  Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations , 2019, INTERSPEECH.

[27]  Maxine Eskénazi,et al.  Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References , 2019, SIGdial.

[28]  Chris Callison-Burch,et al.  ChatEval: A Tool for Chatbot Evaluation , 2019, NAACL.

[29]  Arantxa Otegi,et al.  Survey on evaluation methods for dialogue systems , 2019, Artificial Intelligence Review.

[30]  Jianfeng Gao,et al.  Multi-Domain Task-Completion Dialog Challenge , 2019 .

[31]  Jason Weston,et al.  What makes a good conversation? How controllable attributes affect human judgments , 2019, NAACL.

[32]  Harry Shum,et al.  The Design and Implementation of XiaoIce, an Empathetic Social Chatbot , 2018, CL.

[33]  Y-Lan Boureau,et al.  Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset , 2018, ACL.

[34]  Rada Mihalcea,et al.  MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.

[35]  J. Weston,et al.  Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.

[36]  Mitesh M. Khapra,et al.  Towards Exploiting Background Knowledge for Building Conversation Systems , 2018, EMNLP.

[37]  Alan W. Black,et al.  A Dataset for Document Grounded Conversations , 2018, EMNLP.

[38]  Hai Zhao,et al.  Modeling Multi-turn Conversation with Deep Utterance Aggregation , 2018, COLING.

[39]  Lun-Wei Ku,et al.  EmotionLines: An Emotion Corpus of Multi-Party Conversations , 2018, LREC.

[40]  Jason Weston,et al.  Personalizing Dialogue Agents: I have a dog, do you have pets too? , 2018, ACL.

[41]  Xiaoyu Shen,et al.  DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset , 2017, IJCNLP.

[42]  Minlie Huang,et al.  Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory , 2017, AAAI.

[43]  Zhoujun Li,et al.  Sequential Match Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots , 2016, ArXiv.

[44]  J. Clemente,et al.  Intestinal Microbiota Is Influenced by Gender and Body Mass Index , 2016, PloS one.

[45]  Yuka Kobayashi,et al.  The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics , 2016, LREC.

[46]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[47]  Hang Li,et al.  Neural Responding Machine for Short-Text Conversation , 2015, ACL.

[48]  Rafael E. Banchs Movie-DiC: a Movie Dialogue Corpus for Research and Development , 2012, ACL.

[49]  Cristian Danescu-Niculescu-Mizil,et al.  Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs , 2011, CMCL@ACL.

[50]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[51]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[52]  Khyathi Raghavi Chandu,et al.  Needle in a Haystack: An Analysis of Finding Qualified Workers on MTurk for Summarization , 2022, ArXiv.

[53]  Haizhou Li,et al.  Deep AM-FM: Toolkit for Automatic Dialogue Evaluation , 2020, IWSDS.

[54]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[55]  Bill Dolan,et al.  Grounded Response Generation Task at DSTC7 , 2019 .

[56]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[57]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .