On the Calibration and Uncertainty of Neural Learning to Rank Models for Conversational Search

According to the Probability Ranking Principle (PRP), ranking documents in decreasing order of their probability of relevance leads to an optimal document ranking for ad-hoc retrieval. The PRP holds when two conditions are met: [C1] the models are well calibrated, and, [C2] the probabilities of relevance are reported with certainty. We know however that deep neural networks (DNNs) are often not well calibrated and have several sources of uncertainty, and thus [C1] and [C2] might not be satisfied by neural rankers. Given the success of neural Learning to Rank (L2R) approaches—and here, especially BERT-based approaches—we first analyze under which circumstances deterministic, i.e. outputs point estimates, neural rankers are calibrated. Then, motivated by our findings we use two techniques to model the uncertainty of neural rankers leading to the proposed stochastic rankers, which output a predictive distribution of relevance as opposed to point estimates. Our experimental results on the ad-hoc retrieval task of conversation response ranking1 reveal that (i) BERT-based rankers are not robustly calibrated and that stochastic BERT-based rankers yield better calibration; and (ii) uncertainty estimation is beneficial for both risk-aware neural ranking, i.e. taking into account the uncertainty when ranking documents, and for predicting unanswerable conversational contexts.

[1]  Jimmy J. Lin,et al.  Query Reformulation using Query History for Passage Retrieval in Conversational Search , 2020, ArXiv.

[2]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[3]  David Lopez-Paz,et al.  In Search of Lost Domain Generalization , 2020, ICLR.

[4]  Claudia Hauff,et al.  Curriculum Learning Strategies for IR , 2020, ECIR.

[5]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[6]  Jatin Ganhotra,et al.  A Large-Scale Corpus for Conversation Disentanglement , 2018, ACL.

[7]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[8]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[9]  Jesse Vig,et al.  Comparison of Transfer-Learning Approaches for Response Selection in Multi-Turn Conversations , 2018 .

[10]  Hai Zhao,et al.  Modeling Multi-turn Conversation with Deep Utterance Aggregation , 2018, COLING.

[11]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[12]  Matthew Henderson,et al.  ConveRT: Efficient and Accurate Conversational Representations from Transformers , 2020, EMNLP.

[13]  C. J. van Rijsbergen,et al.  Back to the Roots: Mean-Variance Analysis of Relevance Estimations , 2011, ECIR.

[14]  Chunyuan Yuan,et al.  Multi-hop Selector Network for Multi-turn Response Selection in Retrieval-based Chatbots , 2019, EMNLP.

[15]  Zenglin Xu,et al.  Improving Contextual Language Models for Response Retrieval in Multi-Turn Conversation , 2020, SIGIR.

[16]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Jiafeng Guo,et al.  IART: Intent-aware Response Ranking with Transformers in Information-seeking Conversation Systems , 2020, WWW.

[19]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[20]  Kibok Lee,et al.  A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks , 2018, NeurIPS.

[21]  Raffaele Perego,et al.  The Impact of Negative Samples on Learning to Rank , 2017, LEARNER@ICTIR.

[22]  S. Robertson The probability ranking principle in IR , 1997 .

[23]  Lawrence D. Jackel,et al.  Large Automatic Learning, Rule Extraction, and Generalization , 1987, Complex Syst..

[24]  Michael D. Gordon,et al.  A utility theoretic examination of the probability ranking principle in information retrieval , 1991, J. Am. Soc. Inf. Sci..

[25]  Bhaskar Mitra,et al.  Evaluating Stochastic Rankings with Expected Exposure , 2020, CIKM.

[26]  Deepta Rajan,et al.  Calibrating Healthcare AI: Towards Reliable and Interpretable Deep Predictive Models , 2020, ArXiv.

[27]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[28]  Jun Wang,et al.  Portfolio theory of information retrieval , 2009, SIGIR.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Jianfeng Gao,et al.  A Persona-Based Neural Conversation Model , 2016, ACL.

[31]  Claudia Hauff,et al.  Introducing MANtIS: a novel Multi-Domain Information Seeking Dialogues Dataset , 2019, ArXiv.

[32]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[33]  W. Bruce Croft,et al.  A Deep Look into Neural Ranking Models for Information Retrieval , 2019, Inf. Process. Manag..

[34]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[35]  Jihoon Kim,et al.  Calibrating predictive model estimates to support personalized medicine , 2011, J. Am. Medical Informatics Assoc..

[36]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[37]  Jun Wang,et al.  Mean-Variance Analysis: A New Document Ranking Theory in Information Retrieval , 2009, ECIR.

[38]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[39]  Anoop Cherian,et al.  The Eighth Dialog System Technology Challenge , 2019, ArXiv.

[40]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[41]  Hal R. Varian,et al.  Economics and search , 1999, SIGF.

[42]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[43]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[44]  Jimmy J. Lin,et al.  Simple Applications of BERT for Ad Hoc Document Retrieval , 2019, ArXiv.

[45]  Dongyan Zhao,et al.  Multi-Representation Fusion Network for Multi-Turn Response Selection in Retrieval-Based Chatbots , 2019, WSDM.

[46]  Tiancheng Zhao,et al.  "None of the Above": Measure Uncertainty in Dialog Response Retrieval , 2020, ACL.

[47]  Zhenhua Ling,et al.  Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based Chatbots , 2020, CIKM.

[48]  Taesun Whang,et al.  Domain Adaptive Training BERT for Response Selection , 2019, ArXiv.

[49]  Jun Huang,et al.  Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems , 2018, SIGIR.

[50]  W. Bruce Croft,et al.  Asking Clarifying Questions in Open-Domain Information-Seeking Conversations , 2019, SIGIR.

[51]  Ingemar J. Cox,et al.  Risky business: modeling and exploiting uncertainty in information retrieval , 2009, SIGIR.

[52]  Zhoujun Li,et al.  Sequential Match Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots , 2016, ArXiv.

[53]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[54]  Quan Liu,et al.  Utterance-to-Utterance Interactive Matching Network for Multi-Turn Response Selection in Retrieval-Based Chatbots , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[55]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[56]  W. Bruce Croft,et al.  Analyzing and Characterizing User Intent in Information-seeking Conversations , 2018, SIGIR.