Meta-evaluation of Conversational Search Evaluation Metrics

Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect “actual” performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.

[1]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[2]  Tetsuya Sakai,et al.  The Effect of Topic Sampling on Sensitivity Comparisons of Information Retrieval Metrics , 2005, NTCIR.

[3]  Zhoujun Li,et al.  Sequential Match Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots , 2016, ArXiv.

[4]  Joelle Pineau,et al.  The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems , 2015, SIGDIAL Conference.

[5]  Mirella Lapata,et al.  Vector-based Models of Semantic Composition , 2008, ACL.

[6]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[7]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[8]  Peter W. Foltz,et al.  The Measurement of Textual Coherence with Latent Semantic Analysis. , 1998 .

[9]  Gianni Amati,et al.  Frequentist and Bayesian Approach to Information Retrieval , 2006, ECIR.

[10]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[11]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[12]  Éric Gaussier,et al.  Information-based models for ad hoc IR , 2010, SIGIR '10.

[13]  Filip Radlinski,et al.  Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.

[14]  Tetsuya Sakai Evaluation with informational and navigational intents , 2012, WWW.

[15]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[16]  Dongyan Zhao,et al.  How to Make Context More Useful? An Empirical Study on Context-Aware Neural Conversational Models , 2017, ACL.

[17]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[18]  Jason Weston,et al.  Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.

[19]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[20]  Xiang Li,et al.  Two are Better than One: An Ensemble of Retrieval- and Generation-Based Dialog Systems , 2016, ArXiv.

[21]  Zhaochun Ren,et al.  Explicit State Tracking with Semi-Supervisionfor Neural Dialogue Generation , 2018, CIKM.

[22]  Joelle Pineau,et al.  Bootstrapping Dialog Systems with Word Embeddings , 2014 .

[23]  Jianfeng Gao,et al.  deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets , 2015, ACL.

[24]  Yi Yang,et al.  WikiQA: A Challenge Dataset for Open-Domain Question Answering , 2015, EMNLP.

[25]  Hideo Joho,et al.  Conversational Search (Dagstuhl Seminar 19461) , 2019, Dagstuhl Reports.

[26]  Alan Ritter,et al.  Data-Driven Response Generation in Social Media , 2011, EMNLP.

[27]  Paul A. Crook,et al.  Measuring User Satisfaction on Smart Speaker Intelligent Assistants Using Intent Sensitive Query Embeddings , 2018, CIKM.

[28]  Nobuhiro Kaji,et al.  Prediction of Prospective User Engagement with Intelligent Assistants , 2016, ACL.

[29]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[30]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[31]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[32]  Éric Gaussier,et al.  Bridging Language Modeling and Divergence from Randomness Models: A Log-Logistic Model for IR , 2009, ICTIR.

[33]  Ming-Wei Chang,et al.  A Knowledge-Grounded Neural Conversation Model , 2017, AAAI.

[34]  Chunyu Kit,et al.  Comparative Evaluation of Term Informativeness Measures in Machine Translation Evaluation Metrics , 2011, MTSUMMIT.

[35]  Imed Zitouni,et al.  Understanding User Satisfaction with Intelligent Assistants , 2016, CHIIR.

[36]  Giorgio Gambosi,et al.  FUB, IASI-CNR and University of Tor Vergata at TREC 2008 Blog Track , 2008, TREC.

[37]  Ben Carterette,et al.  Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.

[38]  Stefan Ultes,et al.  MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling , 2018, EMNLP.

[39]  Lidia S. Chao,et al.  LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors , 2012, COLING.

[40]  Yiqun Liu,et al.  Towards Designing Better Session Search Evaluation Metrics , 2018, SIGIR.

[41]  Grigori Sidorov,et al.  Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model , 2014, Computación y Sistemas.

[42]  Anoop Cherian,et al.  The Eighth Dialog System Technology Challenge , 2019, ArXiv.

[43]  Zhe Gan,et al.  Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization , 2018, NeurIPS.

[44]  Mark Sanderson,et al.  Informing the Design of Spoken Conversational Search: Perspective Paper , 2018, CHIIR.

[45]  Tetsuya Sakai,et al.  Metrics, Statistics, Tests , 2013, PROMISE Winter School.

[46]  Jean Carletta,et al.  Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.

[47]  Ali Ahmadvand,et al.  Offline and Online Satisfaction Prediction in Open-Domain Conversational Systems , 2019, CIKM.

[48]  Wei Wei,et al.  PONE , 2020, ACM Trans. Inf. Syst..

[49]  Joelle Pineau,et al.  On the Evaluation of Dialogue Systems with Next Utterance Classification , 2016, SIGDIAL Conference.

[50]  Joseph Olive,et al.  Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation , 2011 .

[51]  Joelle Pineau,et al.  Learning an Unreferenced Metric for Online Dialogue Evaluation , 2020, ACL.

[52]  Dongyan Zhao,et al.  An Ensemble of Retrieval-Based and Generation-Based Human-Computer Conversation Systems , 2018, IJCAI.

[53]  James Allan,et al.  Correlation Between System and User Metrics in a Session , 2016, CHIIR.

[54]  Diane Kelly,et al.  Methods for Evaluating Interactive Information Retrieval Systems with Users , 2009, Found. Trends Inf. Retr..

[55]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[56]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[57]  Xiaoyu Shen,et al.  DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset , 2017, IJCNLP.

[58]  Dongyan Zhao,et al.  RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems , 2017, AAAI.

[59]  Joelle Pineau,et al.  A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.

[60]  Jianfeng Gao,et al.  Multi-Task Learning for Speaker-Role Adaptation in Neural Conversation Models , 2017, IJCNLP.

[61]  Filip Radlinski,et al.  A Theoretical Framework for Conversational Search , 2017, CHIIR.

[62]  Jun Huang,et al.  Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems , 2018, SIGIR.

[63]  Wen Zheng,et al.  Enhancing Conversational Dialogue Models with Grounded Knowledge , 2019, CIKM.

[64]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[65]  Matthew Marge,et al.  Evaluating Evaluation Methods for Generation in the Presence of Variation , 2005, CICLing.

[66]  Wei Wei,et al.  When to Talk: Chatbot Controls the Timing of Talking during Multi-turn Open-domain Dialogue Generation , 2019, ArXiv.

[67]  Maxine Eskenazi,et al.  USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation , 2020, ACL.

[68]  Falk Scholer,et al.  User performance versus precision measures for simple search tasks , 2006, SIGIR.

[69]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[70]  Joelle Pineau,et al.  Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.

[71]  Katharina Kann,et al.  Sentence-Level Fluency Evaluation: References Help, But Can Be Spared! , 2018, CoNLL.

[72]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[73]  Imed Zitouni,et al.  Predicting User Satisfaction with Intelligent Assistants , 2016, SIGIR.

[74]  W. Bruce Croft,et al.  WikiPassageQA: A Benchmark Collection for Research on Non-factoid Answer Passage Retrieval , 2018, SIGIR.

[75]  Delphine Charlet,et al.  SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering , 2017, *SEMEVAL.

[76]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[77]  Yiqun Liu,et al.  Meta-evaluation of Online and Offline Web Search Evaluation Metrics , 2017, SIGIR.

[78]  Tetsuya Sakai,et al.  On the reliability and intuitiveness of aggregated search metrics , 2013, CIKM.

[79]  Mert Kilickaya,et al.  Re-evaluating Automatic Metrics for Image Captioning , 2016, EACL.

[80]  Xuan Liu,et al.  Multi-view Response Selection for Human-Computer Conversation , 2016, EMNLP.

[81]  Tetsuya Sakai,et al.  Which Diversity Evaluation Measures Are "Good"? , 2019, SIGIR.

[82]  Mark Sanderson,et al.  Do user preferences and evaluation measures line up? , 2010, SIGIR.

[83]  Arantxa Otegi,et al.  Survey on evaluation methods for dialogue systems , 2019, Artificial Intelligence Review.

[84]  Alan Ritter,et al.  Unsupervised Modeling of Twitter Conversations , 2010, NAACL.

[85]  Vasile Rus,et al.  A Comparison of Greedy and Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics , 2012, BEA@NAACL-HLT.

[86]  Ben Carterette,et al.  Evaluating multi-query sessions , 2011, SIGIR.

[87]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[88]  Joemon M. Jose,et al.  Evaluating aggregated search pages , 2012, SIGIR '12.

[89]  Louise T. Su Evaluation Measures for Interactive Information Retrieval , 1992, Inf. Process. Manag..

[90]  Ryen W. White,et al.  Understanding and Predicting Graded Search Satisfaction , 2015, WSDM.

[91]  Jianfeng Gao,et al.  A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.

[92]  W. Bruce Croft,et al.  Analyzing and Characterizing User Intent in Information-seeking Conversations , 2018, SIGIR.

[93]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[94]  Mikhail Burtsev,et al.  Goal-Oriented Multi-Task BERT-Based Dialogue State Tracker , 2020, ArXiv.

[95]  Ben Carterette,et al.  From a User Model for Query Sessions to Session Rank Biased Precision (sRBP) , 2019, ICTIR.

[96]  Tetsuya Sakai,et al.  On the reliability of information retrieval metrics based on graded relevance , 2007, Inf. Process. Manag..

[97]  Emmanuel Morin,et al.  Deep Retrieval-Based Dialogue Systems: A Short Review , 2019, ArXiv.

[98]  Nanyun Peng,et al.  Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.

[99]  Noriko Kando,et al.  On information retrieval metrics designed for evaluation with incomplete relevance assessments , 2008, Information Retrieval.