Taking Risks with Confidence

Risk-based evaluation is a failure analysis tool that can be combined with traditional effectiveness metrics to ensure that the improvements observed are consistent across topics when comparing systems. Here we explore the stability of confidence intervals in inference-based risk measurement, extending previous work to five different commonly used inference testing techniques. Using the Robust04 and TREC Core 2017 NYT corpora, we show that risk inferences using parametric methods appear to disagree with their non-parametric counterparts, warranting further investigation. Additionally, we explore how the number of topics being evaluated affects confidence interval stability, and find that more than 50 topics appear to be required before risk-sensitive comparison results are consistent across different inference testing frameworks.

[1]  M. Cowles,et al.  On the Origins of the . 05 Level of Statistical Significance , 2005 .

[2]  Tetsuya Sakai,et al.  Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015 , 2016, SIGIR.

[3]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[4]  J. Shane Culpepper,et al.  On the Pluses and Minuses of Risk , 2019, AIRS.

[5]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[6]  C. Kitchen Nonparametric vs parametric tests of location in biomedical research. , 2009, American journal of ophthalmology.

[7]  Avijit Hazra,et al.  How to choose the right statistical test? , 2011, Indian journal of ophthalmology.

[8]  Debashis Kushary,et al.  Bootstrap Methods and Their Application , 2000, Technometrics.

[9]  Bootstrap Methods and Permutation Tests * , 2022 .

[10]  James Allan,et al.  Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes , 2009, SIGIR.

[11]  Alistair Moffat,et al.  Statistical power in retrieval experimentation , 2008, CIKM '08.

[12]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[13]  Paul N. Bennett,et al.  Robust ranking models via risk-sensitive optimization , 2012, SIGIR '12.

[14]  Austin L. Turner,et al.  Statistical Significance. , 2016, Radiologic technology.

[15]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[16]  James Allan,et al.  TREC 2017 Common Core Track Overview , 2017, TREC.

[17]  David Colquhoun,et al.  An investigation of the false discovery rate and the misinterpretation of p-values , 2014, Royal Society Open Science.

[18]  Craig MacDonald,et al.  Hypothesis testing for the risk-sensitive evaluation of retrieval systems , 2014, SIGIR.