Deciding on an adjustment for multiplicity in IR experiments

We evaluate statistical inference procedures for small-scale IR experiments that involve multiple comparisons against the baseline. These procedures adjust for multiple comparisons by ensuring that the probability of observing at least one false positive in the experiment is below a given threshold. We use only publicly available test collections and make our software available for download. In particular, we employ the TREC runs and runs constructed from the Microsoft learning-to-rank (MSLR) data set. Our focus is on non-parametric statistical procedures that include the Holm-Bonferroni adjustment of the permutation test p-values, the MaxT permutation test, and the permutation-based closed testing. In TREC-based simulations, these procedures retain from 66% to 92% of individually significant results (i.e., those obtained without taking other comparisons into account). Similar retention rates are observed in the MSLR simulations. For the largest evaluated query set size (i.e., 6400), procedures that adjust for multiplicity find at most 5% fewer true differences compared to unadjusted tests. At the same time, unadjusted tests produce many more false positives.

[1]  J. Sunklodas,et al.  Approximation of distributions of sums of weakly dependent random variables by the normal distribution , 1987 .

[2]  Alistair Moffat,et al.  Principles for robust evaluation infrastructure , 2011, DESIRE '11.

[3]  E. Pitman Significance Tests Which May be Applied to Samples from Any Populations , 1937 .

[4]  Alistair Moffat,et al.  Statistical power in retrieval experimentation , 2008, CIKM '08.

[5]  Joseph P. Romano,et al.  Generalizations of the familywise error rate , 2005, math/0507420.

[6]  K. Gabriel,et al.  On closed testing procedures with special reference to ordered analysis of variance , 1976 .

[7]  W. John Wilbur,et al.  Non-parametric significance tests of retrieval performance comparisons , 1994, J. Inf. Sci..

[8]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[9]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[10]  Charles L. A. Clarke,et al.  Overview of the TREC 2010 Web Track , 2010, TREC.

[11]  S. Lange,et al.  Adjusting for multiple testing--when and how? , 2001, Journal of clinical epidemiology.

[12]  Yifan Huang,et al.  To permute or not to permute , 2006, Bioinform..

[13]  Anand Swaminathan,et al.  Information Retrieval System Evaluation , 2012 .

[14]  Ben Carterette,et al.  Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.

[15]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[16]  Mark Sanderson,et al.  Quantifying test collection quality based on the consistency of relevance judgements , 2011, SIGIR.

[17]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[18]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[19]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[20]  Jing Zhou,et al.  Streamwise Feature Selection , 2006, J. Mach. Learn. Res..

[21]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[22]  J. Stephen Downie,et al.  How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval , 2012, ISMIR.

[23]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[24]  J. Hsu,et al.  Applying the Generalized Partitioning Principle to Control the Generalized Familywise Error Rate , 2007, Biometrical journal. Biometrische Zeitschrift.

[25]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[26]  Jon Brumbaugh,et al.  DEPARTMENT OF HEALTH AND HUMAN SERVICES FOOD AND DRUG ADMINISTRATION , 2000 .

[27]  Yogendra P. Chaubey Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[28]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[29]  James F Troendle,et al.  Multiple Testing with Minimal Assumptions , 2008, Biometrical journal. Biometrische Zeitschrift.

[30]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[31]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[32]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[33]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[34]  E. Pitman SIGNIFICANCE TESTS WHICH MAY BE APPLIED TO SAMPLES FROM ANY POPULATIONS III. THE ANALYSIS OF VARIANCE TEST , 1938 .

[35]  Ellen M. Voorhees,et al.  Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[36]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[37]  H. Scheffé A METHOD FOR JUDGING ALL CONTRASTS IN THE ANALYSIS OF VARIANCE , 1953 .

[38]  James Blustein,et al.  A Statistical Analysis of the TREC-3 Data , 1995, TREC.

[39]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[40]  Gordon V. Cormack,et al.  Validity and power of t-test for comparing MAP and GMAP , 2007, SIGIR.