Statistical Significance Testing in Information Retrieval: Theory and Practice

The past 20 years have seen a great improvement in the rigor of information retrieval experimentation, due primarily to two factors: high-quality, public, portable test collections such as those produced by TREC (the Text REtrieval Conference [2]), and the increased practice of statistical hypothesis testing to determine whether measured improvements can be ascribed to something other than random chance. Together these create a very useful standard for reviewers, program committees, and journal editors; work in information retrieval (IR) increasingly cannot be published unless it has been evaluated using a well-constructed test collection and shown to produce a statistically significant improvement over a good baseline. But, as the saying goes, any tool sharp enough to be useful is also sharp enough to be dangerous. Statistical tests of significance are widely misunderstood. Most researchers treat them as a "black box": evaluation results go in and a p-value comes out. Because significance is such an important factor in determining what research directions to explore and what is published, using p-values obtained without thought can have consequences for everyone doing research in IR. Ioannidis has argued that the main consequence in the biomedical sciences is that most published research findings are false [1]; could that be the case in IR as well?

[1]  J. Berger Could Fisher, Jeffreys and Neyman Have Agreed on Testing? , 2003 .

[2]  Ben Carterette,et al.  Hypothesis testing with incomplete relevance judgments , 2007, CIKM '07.

[3]  Regina Nuzzo,et al.  Scientific method: Statistical errors , 2014, Nature.

[4]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[5]  Jean Tague-Sutcliffe,et al.  The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..

[6]  Alistair Moffat,et al.  Statistical power in retrieval experimentation , 2008, CIKM '08.

[7]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[8]  Ellen M. Voorhees,et al.  The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[9]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[10]  Leonid Boytsov,et al.  Deciding on an adjustment for multiplicity in IR experiments , 2013, SIGIR.

[11]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[12]  Alistair Moffat,et al.  What Does It Mean to "Measure Performance"? , 2004, WISE.

[13]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[14]  Ben Carterette,et al.  Reusable test collections through experimental design , 2010, SIGIR.

[15]  Gobinda G. Chowdhury,et al.  TREC: Experiment and Evaluation in Information Retrieval , 2007 .

[16]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[17]  Ben Carterette Model-Based Inference about IR Systems , 2011, ICTIR.

[18]  M. Artés Statistical errors. , 1977, Medicina clinica.

[19]  Alistair Moffat,et al.  Improvements that don't add up: ad-hoc retrieval results since 1998 , 2009, CIKM.

[20]  Ben Carterette,et al.  Simulating simple user behavior for system effectiveness evaluation , 2011, CIKM '11.

[21]  Ben Carterette,et al.  Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.

[22]  James Blustein,et al.  A Statistical Analysis of the TREC-3 Data , 1995, TREC.

[23]  Mónica Marrero,et al.  A comparison of the optimality of statistical significance tests for information retrieval evaluation , 2013, SIGIR.

[24]  Douglas H. Johnson The Insignificance of Statistical Significance Testing , 1999 .

[25]  Ben Carterette,et al.  Incorporating variability in user behavior into systems based evaluation , 2012, CIKM.

[26]  Gordon V. Cormack,et al.  Statistical precision of information retrieval evaluation , 2006, SIGIR.

[27]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[28]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[29]  J Allan,et al.  Readings in information retrieval. , 1998 .

[30]  J. Ioannidis Contradicted and initially stronger effects in highly cited clinical research. , 2005, JAMA.

[31]  James Allan,et al.  Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes , 2009, SIGIR.