A comparison of statistical significance tests for information retrieval evaluation

Information retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher's randomization (permutation) test as non-parametric significance tests for IR but these tests have seen little use. For each of these five tests, we took the ad-hoc retrieval runs submitted to TRECs 3 and 5-8, and for each pair of runs, we measured the statistical significance of the difference in their mean average precision. We discovered that there is little practical difference between the randomization, bootstrap, and t tests. Both the Wilcoxon and sign test have a poor ability to detect significance and have the potential to lead to false detections of significance. The Wilcoxon and sign tests are simplified variants of the randomization test and their use should be discontinued for measuring the significance of a difference between means.

[1]  J. I The Design of Experiments , 1936, Nature.

[2]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[3]  J. V. Bradley Distribution-Free Statistical Tests , 1968 .

[4]  T. E. Doerfler,et al.  The behaviour of some significance tests under experimental randomization , 1969 .

[5]  G. Meek Mathematical statistics with applications , 1973 .

[6]  J. S. Hunter,et al.  Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. , 1979 .

[7]  Editors , 1986, Brain Research Bulletin.

[8]  S. T. Buckland,et al.  Computer-Intensive Methods for Testing Hypotheses. , 1990 .

[9]  K. Ramachandran,et al.  Mathematical Statistics with Applications. , 1992 .

[10]  R. Scheaffer,et al.  Mathematical Statistics with Applications. , 1992 .

[11]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[12]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[13]  W. John Wilbur,et al.  Non-parametric significance tests of retrieval performance comparisons , 1994, J. Inf. Sci..

[14]  Paul R. Cohen,et al.  Empirical methods for artificial intelligence , 1995, IEEE Expert.

[15]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[16]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[17]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[18]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[19]  Gordon V. Cormack,et al.  Statistical precision of information retrieval evaluation , 2006, SIGIR.

[20]  Gordon V. Cormack,et al.  Validity and power of t-test for comparing MAP and GMAP , 2007, SIGIR.

[21]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.