Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015

We conducted a systematic review of 840 SIGIR full papers and 215 TOIS papers published between 2006 and 2015. The original objective of the study was to identify IR effectiveness experiments that are seriously underpowered (i.e., the sample size is far too small so that the probability of missing a real difference is extremely high) or overpowered (i.e., the sample size is so large that a difference will be considered statistically significant even if the actual effect size is extremely small). However, it quickly became clear to us that many IR effectiveness papers either lack significance testing or fail to report p-values and/or test statistics, which prevents us from conducting power analysis. Hence we first report on how IR researchers (fail to) report on significance test results, what types of tests they use, and how the reporting practices may have changed over the last decade. From those papers that reported enough information for us to conduct power analysis, we identify extremely overpowered and underpowered experiments, as well as appropriate sample sizes for future experiments. The raw results of our systematic survey of 1,055 papers and our R scripts for power analysis are available online. Our hope is that this study will help improve the reporting practices and experimental designs of future IR effectiveness studies.

[1]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[2]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[3]  Douglas H. Johnson The Insignificance of Statistical Significance Testing , 1999 .

[4]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[5]  P. Killeen,et al.  An Alternative to Null-Hypothesis Significance Tests , 2005, Psychological science.

[6]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[7]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[8]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[9]  Alistair Moffat,et al.  Improvements that don't add up: ad-hoc retrieval results since 1998 , 2009, CIKM.

[10]  P. Ellis The Essential Guide to Effect Sizes: Drawing conclusions using meta-analysis , 2010 .

[11]  Kalervo Järvelin,et al.  Time drives interaction: simulating sessions in diverse searching environments , 2012, SIGIR '12.

[12]  Stephen E. Robertson,et al.  On per-topic variance in IR evaluation , 2012, SIGIR '12.

[13]  Fabio Crestani,et al.  Aggregation Methods for Proximity-Based Opinion Retrieval , 2012, TOIS.

[14]  Cassidy R. Sugimoto,et al.  A systematic review of interactive information retrieval evaluation studies, 1967-2006 , 2013, J. Assoc. Inf. Sci. Technol..

[15]  Mónica Marrero,et al.  On the measurement of test collection reliability , 2013, SIGIR.

[16]  Tetsuya Sakai,et al.  Statistical reform in information retrieval? , 2014, SIGF.

[17]  Shailendra Kadre,et al.  Introduction to Statistical Analysis , 2015 .

[18]  Tetsuya Sakai,et al.  Topic set size design , 2015, Information Retrieval Journal.

[19]  Ben Carterette Bayesian Inference for Information Retrieval Evaluation , 2015, ICTIR.