Meta-Analysis for Retrieval Experiments Involving Multiple Test Collections

Traditional practice recommends that information retrieval experiments be run over multiple test collections, to support, if not prove, that gains in performance are likely to generalize to other collections or tasks. However, because of the pooling assumptions, evaluation scores are not directly comparable across different test collections. We present a widely-used statistical tool, \em meta-analysis, as a framework for reporting results from IR experiments using multiple test collections. We demonstrate the meta-analytical approach through two standard experiments on stemming and pseudo-relevance feedback, and compare the results to those obtained from score standardization. Meta-analysis incorporates several recent recommendations in the literature, including score standardization, reporting effect sizes rather than score differences, and avoiding a reliance on null-hypothesis statistical testing, in a unified approach. It therefore represents an important methodological improvement over using these techniques in isolation.

[1]  N. Laird,et al.  Meta-analysis in clinical trials revisited. , 2015, Contemporary clinical trials.

[2]  Andrew Trotman,et al.  SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR) , 2015, SIGIR.

[3]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[4]  Joel L. Fagan,et al.  Automatic Phrase Indexing for Document Retrieval: An Examination of Syntactic and Non-Syntactic Methods , 1987, SIGIR.

[5]  Andrew Trotman,et al.  Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR) , 2016, SIGF.

[6]  Tetsuya Sakai,et al.  A Simple and Effective Approach to Score Standardisation , 2016, ICTIR.

[7]  Therese D. Pigott,et al.  How Many Studies Do You Need? , 2010 .

[8]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[9]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, SIGIR '00.

[10]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[11]  Amit Singhal AT&T at TREC-6 , 1997, TREC.

[12]  Mark Sanderson,et al.  Examining Additivity and Weak Baselines , 2016, ACM Trans. Inf. Syst..

[13]  Vishal Gupta,et al.  Text Stemming , 2016, ACM Comput. Surv..

[14]  Chris Buckley Why current IR engines fail , 2004, SIGIR '04.

[15]  Alistair Moffat,et al.  Score standardization for inter-collection comparison of retrieval systems , 2008, SIGIR '08.

[16]  Gordon V. Cormack,et al.  Statistical precision of information retrieval evaluation , 2006, SIGIR.

[17]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[18]  K. O'rourke,et al.  An historical perspective on meta-analysis: Dealing quantitatively with varying study results , 2007, Journal of the Royal Society of Medicine.

[19]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[20]  Stephen E. Robertson,et al.  A few good topics: Experiments in topic set reduction for retrieval evaluation , 2009, TOIS.

[21]  Alistair Moffat,et al.  Improvements that don't add up: ad-hoc retrieval results since 1998 , 2009, CIKM.

[22]  M. S. Patel,et al.  An introduction to meta-analysis. , 1989, Health Policy.

[23]  Donna K. Harman,et al.  A failure analysis of the limitation of suffixing in an online environment , 1987, SIGIR '87.

[24]  Gianmaria Silvello,et al.  Statistical Stemmers: A Reproducibility Study , 2018, ECIR.

[25]  Robert Krovetz Viewing morphology as an inference process , 2000, Artif. Intell..

[26]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.