Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation

Null Hypothesis Significance Testing (NHST) has been recurrently employed as the reference framework to assess the difference in performance between Information Retrieval (IR) systems. IR practitioners customarily apply significance tests, such as the t-test, the Wilcoxon Signed Rank test, the Permutation test, the Sign test or the Bootstrap test. However, the question of which of these tests is the most reliable in IR experimentation is still controversial. Different authors have tried to shed light on this issue, but their conclusions are not in agreement. In this paper, we present a new methodology for assessing the behavior of significance tests in typical ranking tasks. Our method creates models from the search systems and uses those models to simulate different inputs to the significance tests. With such an approach, we can control the experimental conditions and run experiments with full knowledge about the truth or falseness of the null hypothesis. Following our methodology, we computed a series of simulations that estimate the proportion of Type I and Type II errors made by different tests. Results conclusively suggest that the Wilcoxon test is the most reliable test and, thus, IR practitioners should adopt it as the reference tool to assess differences between IR systems.

[1]  Marco Marozzi,et al.  Nonparametric Simultaneous Tests for Location and Scale Testing: A Comparison of Several Methods , 2013, Commun. Stat. Simul. Comput..

[2]  Stephen E. Robertson,et al.  Modelling Score Distributions Without Actual Scores , 2013, ICTIR.

[3]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[4]  Gordon V. Cormack,et al.  Validity and power of t-test for comparing MAP and GMAP , 2007, SIGIR.

[5]  Tetsuya Sakai,et al.  Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015 , 2016, SIGIR.

[6]  C. Borror Practical Nonparametric Statistics, 3rd Ed. , 2001 .

[7]  Robustness Against Inequality of Variances , 1982 .

[8]  Ben Carterette,et al.  Statistical Significance Testing in Information Retrieval: Theory and Practice , 2014, SIGIR.

[9]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[10]  David E. Losada,et al.  Using score distributions to compare statistical significance tests for information retrieval evaluation , 2019, J. Assoc. Inf. Sci. Technol..

[11]  J. J. Higgins,et al.  Comparison of the power of the paired samples t test to that of Wilcoxon's signed-ranks test under various population shapes. , 1985 .

[12]  Julián Urbano,et al.  Stochastic Simulation of Test Collections: Evaluation Scores , 2018, SIGIR.

[13]  H. Levene Robust tests for equality of variances , 1961 .

[14]  Tetsuya Sakai Two Sample T-tests for IR Evaluation: Student or Welch? , 2016, SIGIR.

[15]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[16]  Alan Hanjalic,et al.  Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors , 2019, SIGIR.

[17]  H. Keselman,et al.  Multiple Comparison Procedures , 2005 .

[18]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[19]  Y. B. Wah,et al.  Power comparisons of Shapiro-Wilk , Kolmogorov-Smirnov , Lilliefors and Anderson-Darling tests , 2011 .

[20]  Yifan Huang,et al.  To permute or not to permute , 2006, Bioinform..

[21]  Robert J. Boik,et al.  The Fisher-Pitman permutation test: A non-robust alternative to the normal theory F test when variances are heterogeneous , 1987 .

[22]  Evangelos Kanoulas,et al.  Score distribution models: assumptions, intuition, and robustness to score manipulation , 2010, SIGIR.

[23]  Tetsuya Sakai,et al.  Statistical reform in information retrieval? , 2014, SIGF.

[24]  R. Blair,et al.  A more realistic look at the robustness and Type II error properties of the t test to departures from population normality. , 1992 .