Challenges in the Evaluation of Conversational Search Systems

The area of conversational search has gained significant traction in the IR research community, motivated by the widespread use of personal assistants. An often researched task in this setting is conversation response ranking, that is, to retrieve the best response for a given ongoing conversation from a corpus of historic conversations. While this is intuitively an important step towards (retrieval-based) conversational search, the empirical evaluation currently employed to evaluate trained rankers is very far from this setup: typically, an extremely small number (e.g., 10) of non-relevant responses and a single relevant response are presented to the ranker. In a real-world scenario, a retrieval-based system has to retrieve responses from a large (e.g., several millions) pool of responses or determine that no appropriate response can be found. In this paper we aim to highlight these critical issues in the offline evaluation schemes for tasks related to conversational search. With this paper, we argue that the currently in-use evaluation schemes have critical limitations and simplify the conversational search tasks to a degree that makes it questionable whether we can trust the findings they deliver. ACM Reference Format: Gustavo Penha and Claudia Hauff. 2020. Challenges in the Evaluation of Conversational Search Systems. In Proceedings of KDD Workshop on Conversational Systems Towards Mainstream Adoption (KDD Converse’20). ACM, New York, NY, USA, 5 pages. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

[1]  Matthew Henderson,et al.  ConveRT: Efficient and Accurate Conversational Representations from Transformers , 2020, EMNLP.

[2]  Arantxa Otegi,et al.  Survey on evaluation methods for dialogue systems , 2019, Artificial Intelligence Review.

[3]  Alex Bălan,et al.  MANtIS: a novel information seeking dialogues dataset , 2019 .

[4]  Lihong Li,et al.  Neural Approaches to Conversational AI , 2019, Found. Trends Inf. Retr..

[5]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[6]  W. Bruce Croft,et al.  Attentive History Selection for Conversational Question Answering , 2019, CIKM.

[7]  Emine Yilmaz,et al.  Inferred AP : Estimating Average Precision with Incomplete Judgments , 2006 .

[8]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9]  James Allan,et al.  Semiautomatic evaluation of retrieval systems using document similarities , 2007, CIKM '07.

[10]  Robert N. Oddy,et al.  INFORMATION RETRIEVAL THROUGH MAN‐MACHINE DIALOGUE , 1977 .

[11]  W. Bruce Croft,et al.  Analyzing and Characterizing User Intent in Information-seeking Conversations , 2018, SIGIR.

[12]  Pawel Budzianowski,et al.  Uncertainty Estimates for Efficient Neural Network-based Dialogue Policy Optimisation , 2017, ArXiv.

[13]  Joelle Pineau,et al.  The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems , 2015, SIGDIAL Conference.

[14]  Dennis Fowler,et al.  None of the above , 2002, NTWK.

[15]  Thorsten Joachims,et al.  Counterfactual Evaluation and Learning for Search, Recommendation and Ad Placement , 2016, SIGIR.

[16]  Iryna Gurevych,et al.  Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2018, ACL 2018.

[17]  Jimmy J. Lin,et al.  Query Reformulation using Query History for Passage Retrieval in Conversational Search , 2020, ArXiv.

[18]  Walter S. Lasecki,et al.  DSTC7 Task 1: Noetic End-to-End Response Selection , 2019, Proceedings of the First Workshop on NLP for Conversational AI.

[19]  Gary Marchionini,et al.  Exploratory search , 2006, Commun. ACM.

[20]  Chenyan Xiong,et al.  TREC CAsT 2019: The Conversational Assistance Track Overview , 2020, arXiv.org.

[21]  Ying Chen,et al.  Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network , 2018, ACL.

[22]  Wei Bi,et al.  Learning to Abstract for Memory-augmented Conversational Response Generation , 2019, ACL.

[23]  Jianfeng Gao,et al.  A Persona-Based Neural Conversation Model , 2016, ACL.

[24]  Jianxiong Dong,et al.  Enhance word representation for out-of-vocabulary on Ubuntu dialogue corpus , 2018, ArXiv.

[25]  Y-Lan Boureau,et al.  Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset , 2018, ACL.

[26]  Dongyan Zhao,et al.  Multi-Representation Fusion Network for Multi-Turn Response Selection in Retrieval-Based Chatbots , 2019, WSDM.

[27]  Quoc V. Le,et al.  A Neural Conversational Model , 2015, ArXiv.

[28]  W. Bruce Croft,et al.  From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing , 2018, CIKM.

[29]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[30]  Jiafeng Guo,et al.  IART: Intent-aware Response Ranking with Transformers in Information-seeking Conversation Systems , 2020, WWW.

[31]  Claudia Hauff,et al.  Introducing MANtIS: a novel Multi-Domain Information Seeking Dialogues Dataset , 2019, ArXiv.

[32]  Martin Halvey,et al.  Investigating how conversational search agents affect user's behaviour, performance and search experience , 2018 .

[33]  Jason Weston,et al.  Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.

[34]  Hal Daumé,et al.  Answer-based Adversarial Training for Generating Clarification Questions , 2019, NAACL.

[35]  Xiaodong Liu,et al.  A Hybrid Retrieval-Generation Neural Conversation Model , 2019, CIKM.

[36]  Alan Ritter,et al.  Adversarial Learning for Neural Dialogue Generation , 2017, EMNLP.

[37]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[38]  Jun Huang,et al.  Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems , 2018, SIGIR.

[39]  Sudha Rao Are You Asking the Right Questions? Teaching Machines to Ask Clarification Questions , 2017, ACL.

[40]  Claudia Hauff,et al.  Predicting the effectiveness of queries and retrieval systems , 2010, SIGF.

[41]  Charles L. A. Clarke,et al.  Exploring Conversational Search With Humans, Assistants, and Wizards , 2017, CHI Extended Abstracts.

[42]  Tiancheng Zhao,et al.  "None of the Above": Measure Uncertainty in Dialog Response Retrieval , 2020, ACL.

[43]  Chunyuan Yuan,et al.  Multi-hop Selector Network for Multi-turn Response Selection in Retrieval-based Chatbots , 2019, EMNLP.

[44]  Joonhwan Lee,et al.  Tell Me More: Understanding User Interaction of Smart Speaker News Powered by Conversational Search , 2019, CHI Extended Abstracts.

[45]  Jianfeng Gao,et al.  Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[46]  Gerard Salton,et al.  The "generality" effect and the retrieval evaluation for large collections , 1972, J. Am. Soc. Inf. Sci..

[47]  Enhong Chen,et al.  Context-aware query suggestion by mining click-through and session data , 2008, KDD.

[48]  Matthias Hagen,et al.  Toward Voice Query Clarification , 2018, SIGIR.

[49]  Quan Liu,et al.  Utterance-to-Utterance Interactive Matching Network for Multi-Turn Response Selection in Retrieval-Based Chatbots , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[50]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[51]  Tetsuya Sakai,et al.  Overview of the NTCIR-12 Short Text Conversation Task , 2016, NTCIR.

[52]  Xuan Wang,et al.  Variational Autoregressive Decoder for Neural Response Generation , 2018, EMNLP.

[53]  Hai Zhao,et al.  Modeling Multi-turn Conversation with Deep Utterance Aggregation , 2018, COLING.

[54]  Eunsol Choi,et al.  QuAC: Question Answering in Context , 2018, EMNLP.

[55]  Quoc V. Le,et al.  Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.

[56]  Zhoujun Li,et al.  Sequential Match Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots , 2016, ArXiv.

[57]  Ellen M. Voorhees,et al.  Overview of the TREC-9 Question Answering Track , 2000, TREC.

[58]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[59]  Danqi Chen,et al.  CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[60]  Jason Weston,et al.  Personalizing Dialogue Agents: I have a dog, do you have pets too? , 2018, ACL.

[61]  Mark Sanderson,et al.  How Do People Interact in Conversational Speech-Only Search Tasks: A Preliminary Analysis , 2017, CHIIR.

[62]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[63]  Claudia Hauff,et al.  Curriculum Learning Strategies for IR , 2019, ECIR.

[64]  Daniel McDuff,et al.  MISC: A data set of information-seeking conversations , 2017 .

[65]  Si Wei,et al.  Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based Chatbots , 2020, CIKM.

[66]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[67]  Gary Marchionini,et al.  Synthesis Lectures on Information Concepts, Retrieval, and Services , 2009 .

[68]  Ryen W. White,et al.  Exploratory Search: Beyond the Query-Response Paradigm , 2009, Exploratory Search: Beyond the Query-Response Paradigm.

[69]  Hal Daumé,et al.  Learning to Ask Good Questions: Ranking Clarification Questions using Neural Expected Value of Perfect Information , 2018, ACL.