Meta-evaluation of Conversational Search Evaluation Metrics
暂无分享,去创建一个
Ke Zhou | Max L. Wilson | Zeyang Liu | K. Zhou | Zeyang Liu
[1] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.
[2] Tetsuya Sakai,et al. The Effect of Topic Sampling on Sensitivity Comparisons of Information Retrieval Metrics , 2005, NTCIR.
[3] Zhoujun Li,et al. Sequential Match Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots , 2016, ArXiv.
[4] Joelle Pineau,et al. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems , 2015, SIGDIAL Conference.
[5] Mirella Lapata,et al. Vector-based Models of Semantic Composition , 2008, ACL.
[6] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[7] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[8] Peter W. Foltz,et al. The Measurement of Textual Coherence with Latent Semantic Analysis. , 1998 .
[9] Gianni Amati,et al. Frequentist and Bayesian Approach to Information Retrieval , 2006, ECIR.
[10] Alistair Moffat,et al. Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.
[11] T. Landauer,et al. A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .
[12] Éric Gaussier,et al. Information-based models for ad hoc IR , 2010, SIGIR '10.
[13] Filip Radlinski,et al. Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.
[14] Tetsuya Sakai. Evaluation with informational and navigational intents , 2012, WWW.
[15] C. J. van Rijsbergen,et al. Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.
[16] Dongyan Zhao,et al. How to Make Context More Useful? An Empirical Study on Context-Aware Neural Conversational Models , 2017, ACL.
[17] Tetsuya Sakai,et al. Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.
[18] Jason Weston,et al. Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.
[19] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.
[20] Xiang Li,et al. Two are Better than One: An Ensemble of Retrieval- and Generation-Based Dialog Systems , 2016, ArXiv.
[21] Zhaochun Ren,et al. Explicit State Tracking with Semi-Supervisionfor Neural Dialogue Generation , 2018, CIKM.
[22] Joelle Pineau,et al. Bootstrapping Dialog Systems with Word Embeddings , 2014 .
[23] Jianfeng Gao,et al. deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets , 2015, ACL.
[24] Yi Yang,et al. WikiQA: A Challenge Dataset for Open-Domain Question Answering , 2015, EMNLP.
[25] Hideo Joho,et al. Conversational Search (Dagstuhl Seminar 19461) , 2019, Dagstuhl Reports.
[26] Alan Ritter,et al. Data-Driven Response Generation in Social Media , 2011, EMNLP.
[27] Paul A. Crook,et al. Measuring User Satisfaction on Smart Speaker Intelligent Assistants Using Intent Sensitive Query Embeddings , 2018, CIKM.
[28] Nobuhiro Kaji,et al. Prediction of Prospective User Engagement with Intelligent Assistants , 2016, ACL.
[29] Stephen E. Robertson,et al. Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.
[30] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[31] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.
[32] Éric Gaussier,et al. Bridging Language Modeling and Divergence from Randomness Models: A Log-Logistic Model for IR , 2009, ICTIR.
[33] Ming-Wei Chang,et al. A Knowledge-Grounded Neural Conversation Model , 2017, AAAI.
[34] Chunyu Kit,et al. Comparative Evaluation of Term Informativeness Measures in Machine Translation Evaluation Metrics , 2011, MTSUMMIT.
[35] Imed Zitouni,et al. Understanding User Satisfaction with Intelligent Assistants , 2016, CHIIR.
[36] Giorgio Gambosi,et al. FUB, IASI-CNR and University of Tor Vergata at TREC 2008 Blog Track , 2008, TREC.
[37] Ben Carterette,et al. Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.
[38] Stefan Ultes,et al. MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling , 2018, EMNLP.
[39] Lidia S. Chao,et al. LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors , 2012, COLING.
[40] Yiqun Liu,et al. Towards Designing Better Session Search Evaluation Metrics , 2018, SIGIR.
[41] Grigori Sidorov,et al. Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model , 2014, Computación y Sistemas.
[42] Anoop Cherian,et al. The Eighth Dialog System Technology Challenge , 2019, ArXiv.
[43] Zhe Gan,et al. Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization , 2018, NeurIPS.
[44] Mark Sanderson,et al. Informing the Design of Spoken Conversational Search: Perspective Paper , 2018, CHIIR.
[45] Tetsuya Sakai,et al. Metrics, Statistics, Tests , 2013, PROMISE Winter School.
[46] Jean Carletta,et al. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.
[47] Ali Ahmadvand,et al. Offline and Online Satisfaction Prediction in Open-Domain Conversational Systems , 2019, CIKM.
[48] Wei Wei,et al. PONE , 2020, ACM Trans. Inf. Syst..
[49] Joelle Pineau,et al. On the Evaluation of Dialogue Systems with Next Utterance Classification , 2016, SIGDIAL Conference.
[50] Joseph Olive,et al. Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation , 2011 .
[51] Joelle Pineau,et al. Learning an Unreferenced Metric for Online Dialogue Evaluation , 2020, ACL.
[52] Dongyan Zhao,et al. An Ensemble of Retrieval-Based and Generation-Based Human-Computer Conversation Systems , 2018, IJCAI.
[53] James Allan,et al. Correlation Between System and User Metrics in a Session , 2016, CHIIR.
[54] Diane Kelly,et al. Methods for Evaluating Interactive Information Retrieval Systems with Users , 2009, Found. Trends Inf. Retr..
[55] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[56] Djoerd Hiemstra,et al. Using language models for information retrieval , 2001 .
[57] Xiaoyu Shen,et al. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset , 2017, IJCNLP.
[58] Dongyan Zhao,et al. RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems , 2017, AAAI.
[59] Joelle Pineau,et al. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.
[60] Jianfeng Gao,et al. Multi-Task Learning for Speaker-Role Adaptation in Neural Conversation Models , 2017, IJCNLP.
[61] Filip Radlinski,et al. A Theoretical Framework for Conversational Search , 2017, CHIIR.
[62] Jun Huang,et al. Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems , 2018, SIGIR.
[63] Wen Zheng,et al. Enhancing Conversational Dialogue Models with Grounded Knowledge , 2019, CIKM.
[64] Joelle Pineau,et al. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.
[65] Matthew Marge,et al. Evaluating Evaluation Methods for Generation in the Presence of Variation , 2005, CICLing.
[66] Wei Wei,et al. When to Talk: Chatbot Controls the Timing of Talking during Multi-turn Open-domain Dialogue Generation , 2019, ArXiv.
[67] Maxine Eskenazi,et al. USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation , 2020, ACL.
[68] Falk Scholer,et al. User performance versus precision measures for simple search tasks , 2006, SIGIR.
[69] Jaana Kekäläinen,et al. Cumulated gain-based evaluation of IR techniques , 2002, TOIS.
[70] Joelle Pineau,et al. Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.
[71] Katharina Kann,et al. Sentence-Level Fluency Evaluation: References Help, But Can Be Spared! , 2018, CoNLL.
[72] Verena Rieser,et al. Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.
[73] Imed Zitouni,et al. Predicting User Satisfaction with Intelligent Assistants , 2016, SIGIR.
[74] W. Bruce Croft,et al. WikiPassageQA: A Benchmark Collection for Research on Non-factoid Answer Passage Retrieval , 2018, SIGIR.
[75] Delphine Charlet,et al. SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering , 2017, *SEMEVAL.
[76] Olivier Chapelle,et al. Expected reciprocal rank for graded relevance , 2009, CIKM.
[77] Yiqun Liu,et al. Meta-evaluation of Online and Offline Web Search Evaluation Metrics , 2017, SIGIR.
[78] Tetsuya Sakai,et al. On the reliability and intuitiveness of aggregated search metrics , 2013, CIKM.
[79] Mert Kilickaya,et al. Re-evaluating Automatic Metrics for Image Captioning , 2016, EACL.
[80] Xuan Liu,et al. Multi-view Response Selection for Human-Computer Conversation , 2016, EMNLP.
[81] Tetsuya Sakai,et al. Which Diversity Evaluation Measures Are "Good"? , 2019, SIGIR.
[82] Mark Sanderson,et al. Do user preferences and evaluation measures line up? , 2010, SIGIR.
[83] Arantxa Otegi,et al. Survey on evaluation methods for dialogue systems , 2019, Artificial Intelligence Review.
[84] Alan Ritter,et al. Unsupervised Modeling of Twitter Conversations , 2010, NAACL.
[85] Vasile Rus,et al. A Comparison of Greedy and Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics , 2012, BEA@NAACL-HLT.
[86] Ben Carterette,et al. Evaluating multi-query sessions , 2011, SIGIR.
[87] Stephen P. Harter,et al. A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..
[88] Joemon M. Jose,et al. Evaluating aggregated search pages , 2012, SIGIR '12.
[89] Louise T. Su. Evaluation Measures for Interactive Information Retrieval , 1992, Inf. Process. Manag..
[90] Ryen W. White,et al. Understanding and Predicting Graded Search Satisfaction , 2015, WSDM.
[91] Jianfeng Gao,et al. A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.
[92] W. Bruce Croft,et al. Analyzing and Characterizing User Intent in Information-seeking Conversations , 2018, SIGIR.
[93] Matthijs Douze,et al. FastText.zip: Compressing text classification models , 2016, ArXiv.
[94] Mikhail Burtsev,et al. Goal-Oriented Multi-Task BERT-Based Dialogue State Tracker , 2020, ArXiv.
[95] Ben Carterette,et al. From a User Model for Query Sessions to Session Rank Biased Precision (sRBP) , 2019, ICTIR.
[96] Tetsuya Sakai,et al. On the reliability of information retrieval metrics based on graded relevance , 2007, Inf. Process. Manag..
[97] Emmanuel Morin,et al. Deep Retrieval-Based Dialogue Systems: A Short Review , 2019, ArXiv.
[98] Nanyun Peng,et al. Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.
[99] Noriko Kando,et al. On information retrieval metrics designed for evaluation with incomplete relevance assessments , 2008, Information Retrieval.