How do Metric Score Distributions affect the Type I Error Rate of Statistical Significance Tests in Information Retrieval?

Statistical significance tests are the main tool that IR practitioners use to determine the reliability of their experimental evaluation results. The question of which test behaves best with IR evaluation data has been around for decades, and has seen all kinds of results and recommendations. Definitive answer to this question has recently been attempted via stochastic simulation of IR evaluation data, allowing researchers to compute actual Type I error rates because they can control the null hypothesis. One such research line simulates metric scores for a fixed set of systems on random topics, and concluded that the t-test behaves the best. Another such line simulates retrieval runs by random systems on a fixed set of topics, and concluded that the Wilcoxon test behaves the best. Interestingly, two recent surveys of the IR literature have shown that the community has a clear preference precisely for these two tests, so further investigation is critical to understand why the above simulation studies reach opposite conclusions. It has been recently postulated that a reason for the disagreement is the distributions of metric scores used by one of these simulation methods. In this paper we investigate this issue and extend the argument to another key aspect of the simulation, namely the dependence between systems. Following a principled approach, we analyze the robustness of statistical tests to different factors, thus identifying under what conditions they behave well or not with respect to the Type I error rate. Our results suggest that differences between the Wilcoxon and t-test may be explained by the skewness of score differences.

[1]  A. C. Berry The accuracy of the Gaussian approximation to the sum of independent variates , 1941 .

[2]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[3]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[4]  W. John Wilbur,et al.  Non-parametric significance tests of retrieval performance comparisons , 1994, J. Inf. Sci..

[5]  Margaret J. Robertson,et al.  Design and Analysis of Experiments , 2006, Handbook of statistics.

[6]  Tetsuya Sakai,et al.  Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015 , 2016, SIGIR.

[7]  Wilkie W. Chaffin,et al.  The effect of skewness and kurtosis on the one-sample T test and the impact of knowledge of the population standard deviation , 1993 .

[8]  David E. Losada,et al.  Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation , 2021, SAC.

[9]  David E. Losada,et al.  Using score distributions to compare statistical significance tests for information retrieval evaluation , 2019, J. Assoc. Inf. Sci. Technol..

[10]  R. Blair,et al.  A more realistic look at the robustness and Type II error properties of the t test to departures from population normality. , 1992 .

[11]  B. Efron Nonparametric standard errors and confidence intervals , 1981 .

[12]  Julián Urbano,et al.  Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation , 2016, Information Retrieval Journal.

[13]  On the robusness of the one one sample t test , 1989 .

[14]  Mónica Marrero,et al.  A comparison of the optimality of statistical significance tests for information retrieval evaluation , 2013, SIGIR.

[15]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[16]  Alan Hanjalic,et al.  Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors , 2019, SIGIR.

[17]  Julián Urbano,et al.  Stochastic Simulation of Test Collections: Evaluation Scores , 2018, SIGIR.

[18]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[19]  Gordon V. Cormack,et al.  Validity and power of t-test for comparing MAP and GMAP , 2007, SIGIR.

[20]  James Allan,et al.  Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes , 2009, SIGIR.

[21]  Ben Carterette,et al.  But Is It Statistically Significant?: Statistical Significance in IR Research, 1995-2014 , 2017, SIGIR.

[22]  Ellen M. Voorhees,et al.  Topic set size redux , 2009, SIGIR.

[23]  Jonathan A. Tawn,et al.  Bivariate extreme value theory: Models and estimation , 1988 .

[24]  S. R. Searle,et al.  Population Marginal Means in the Linear Model: An Alternative to Least Squares Means , 1980 .

[25]  R. Manmatha,et al.  Modeling score distributions for combining the outputs of search engines , 2001, SIGIR '01.

[26]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.