Bayesian Inferential Risk Evaluation On Multiple IR Systems

Information retrieval (IR) ranking models in production systems continually evolve in response to user feedback, insights from research, and new developments. Rather than investing all engineering resources to produce a single challenger to the existing system, a commercial provider might choose to explore multiple new ranking models simultaneously. However, even small changes to a complex model can have unintended consequences. In particular, the per-topic effectiveness profile is likely to change, and even when an overall improvement is achieved, gains are rarely observed for every query, introducing the risk that some users or queries may be negatively impacted by the new model if deployed into production. Risk adjustments that re-weight losses relative to gains and mitigate such behavior are available when making one-to-one system comparisons, but not for one-to-many or many-to-one comparisons. Moreover, no IR evaluation methodology integrates priors from previous or alternative rankers in a homogeneous inferential framework. In this work, we propose a Bayesian approach where multiple challengers are compared to a single champion. We also show that risk can be incorporated, and demonstrate the benefits of doing so. Finally, the alternative scenario that is commonly encountered in academic research is also considered, when a single challenger is compared against several previous champions.

[1]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[2]  J. Shane Culpepper,et al.  Risk-Reward Trade-offs in Rank Fusion , 2017, ADCS.

[3]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[4]  Charles L. A. Clarke,et al.  Reciprocal rank fusion outperforms condorcet and individual rank learning methods , 2009, SIGIR.

[5]  Norbert Fuhr,et al.  Some Common Mistakes In IR Evaluation, And How They Can Be Avoided , 2018, SIGIR Forum.

[6]  Mark Sanderson,et al.  Using Collection Shards to Study Retrieval Performance Effect Sizes , 2019, ACM Trans. Inf. Syst..

[7]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[8]  Alan Hanjalic,et al.  Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors , 2019, SIGIR.

[9]  Noriko Kando,et al.  On information retrieval metrics designed for evaluation with incomplete relevance assessments , 2008, Information Retrieval.

[10]  Nicola Ferro,et al.  A General Linear Mixed Models Approach to Study System Component Effects , 2016, SIGIR.

[11]  Tetsuya Sakai,et al.  Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015 , 2016, SIGIR.

[12]  J. Shane Culpepper,et al.  Taking Risks with Confidence , 2019, ADCS.

[13]  Jonah Gabry,et al.  User-friendly Bayesian regression modeling: A tutorial with rstanarm and shinystan , 2018 .

[14]  Tetsuya Sakai The Probability that Your Hypothesis Is Correct, Credible Intervals, and Effect Sizes for IR Evaluation , 2017, SIGIR.

[15]  Craig MacDonald,et al.  Risk-Sensitive Evaluation and Learning to Rank using Multiple Baselines , 2016, SIGIR.

[16]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[17]  Thierson Couto,et al.  Risk-Sensitive Learning to Rank with Evolutionary Multi-Objective Feature Selection , 2019, ACM Trans. Inf. Syst..

[18]  Ben Carterette Bayesian Inference for Information Retrieval Evaluation , 2015, ICTIR.

[19]  David E. Losada,et al.  Using score distributions to compare statistical significance tests for information retrieval evaluation , 2019, J. Assoc. Inf. Sci. Technol..

[20]  Paul N. Bennett,et al.  Robust ranking models via risk-sensitive optimization , 2012, SIGIR '12.

[21]  D. Barr,et al.  Random effects structure for confirmatory hypothesis testing: Keep it maximal. , 2013, Journal of memory and language.

[22]  A. Azzalini A class of distributions which includes the normal ones , 1985 .

[23]  Ben Carterette Model-Based Inference about IR Systems , 2011, ICTIR.

[24]  Alistair Moffat,et al.  Improvements that don't add up: ad-hoc retrieval results since 1998 , 2009, CIKM.