Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors

Statistical significance testing is widely accepted as a means to assess how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise because of the selection of topics. According to recent surveys on SIGIR, CIKM, ECIR and TOIS papers, the t-test is the most popular choice among IR researchers. However, previous work has suggested computer intensive tests like the bootstrap or the permutation test, based mainly on theoretical arguments. On empirical grounds, others have suggested non-parametric alternatives such as the Wilcoxon test. Indeed, the question of which tests we should use has accompanied IR and related fields for decades now. Previous theoretical studies on this matter were limited in that we know that test assumptions are not met in IR experiments, and empirical studies were limited in that we do not have the necessary control over the null hypotheses to compute actual Type I and Type II error rates under realistic conditions. Therefore, not only is it unclear which test to use, but also how much trust we should put in them. In contrast to past studies, in this paper we employ a recent simulation methodology from TREC data to go around these limitations. Our study comprises over 500 million p-values computed for a range of tests, systems, effectiveness measures, topic set sizes and effect sizes, and for both the 2-tail and 1-tail cases. Having such a large supply of IR evaluation data with full knowledge of the null hypotheses, we are finally in a position to evaluate how well statistical significance tests really behave with IR data, and make sound recommendations for practitioners.

[1]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[2]  E. Pitman Significance Tests Which May be Applied to Samples from Any Populations , 1937 .

[3]  James Allan,et al.  Evaluation over thousands of queries , 2008, SIGIR '08.

[4]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[5]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[6]  Mark Sanderson,et al.  Size and Source Matter: Understanding Inconsistencies in Test Collection-Based Evaluation , 2014, CIKM.

[7]  W. J. Conover,et al.  Practical Nonparametric Statistics , 1972 .

[8]  Ben Carterette Bayesian Inference for Information Retrieval Evaluation , 2015, ICTIR.

[9]  Mónica Marrero,et al.  On the measurement of test collection reliability , 2013, SIGIR.

[10]  James Allan,et al.  Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes , 2009, SIGIR.

[11]  Gordon V. Cormack,et al.  Validity and power of t-test for comparing MAP and GMAP , 2007, SIGIR.

[12]  Tetsuya Sakai,et al.  Topic set size design , 2015, Information Retrieval Journal.

[13]  Davis B. McCarn,et al.  A mathematical model of retrieval system performance , 1990, J. Am. Soc. Inf. Sci..

[14]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[15]  Tetsuya Sakai,et al.  Statistical reform in information retrieval? , 2014, SIGF.

[16]  Gordon V. Cormack,et al.  Statistical precision of information retrieval evaluation , 2006, SIGIR.

[17]  Omar Alonso,et al.  Using crowdsourcing for TREC relevance assessment , 2012, Inf. Process. Manag..

[18]  J. Pratt Remarks on Zeros and Ties in the Wilcoxon Signed Rank Procedures , 1959 .

[19]  H. Joe Dependence Modeling with Copulas , 2014 .

[20]  W. J. Conover,et al.  A Note on the Small-Sample Power Functions for Nonparametric Tests of Location in the Double Exponential Family , 1978 .

[21]  Ying Zhang,et al.  Differences in effectiveness across sub-collections , 2012, CIKM.

[22]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[23]  J. I The Design of Experiments , 1936, Nature.

[24]  Tetsuya Sakai Two Sample T-tests for IR Evaluation: Student or Welch? , 2016, SIGIR.

[25]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[26]  Julián Urbano,et al.  Stochastic Simulation of Test Collections: Evaluation Scores , 2018, SIGIR.

[27]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[28]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[29]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[30]  Julián Urbano,et al.  Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation , 2016, Information Retrieval Journal.

[31]  W. John Wilbur,et al.  Non-parametric significance tests of retrieval performance comparisons , 1994, J. Inf. Sci..

[32]  Ben Carterette,et al.  Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.

[33]  Francis Tuerlinckx,et al.  Type S error rates for classical and Bayesian single and multiple comparison procedures , 2000, Comput. Stat..

[34]  Nicola Ferro,et al.  Are IR Evaluation Measures on an Interval Scale? , 2017, ICTIR.

[35]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[36]  Ben Carterette,et al.  But Is It Statistically Significant?: Statistical Significance in IR Research, 1995-2014 , 2017, SIGIR.

[37]  Mónica Marrero,et al.  A comparison of the optimality of statistical significance tests for information retrieval evaluation , 2013, SIGIR.

[38]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[39]  Ellen M. Voorhees,et al.  The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[40]  Ellen M. Voorhees,et al.  Topic set size redux , 2009, SIGIR.

[41]  H. Kaiser,et al.  Directional statistical decisions. , 1960, Psychological review.

[42]  David E. Losada,et al.  Using score distributions to compare statistical significance tests for information retrieval evaluation , 2019, J. Assoc. Inf. Sci. Technol..

[43]  W. J. Conover,et al.  On Methods of Handling Ties in the Wilcoxon Signed-Rank Test , 1973 .

[44]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[45]  Stephen E. Robertson,et al.  On per-topic variance in IR evaluation , 2012, SIGIR '12.

[46]  Tetsuya Sakai,et al.  Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015 , 2016, SIGIR.