论文信息 - Meta-Analysis for Retrieval Experiments Involving Multiple Test Collections

Meta-Analysis for Retrieval Experiments Involving Multiple Test Collections

Traditional practice recommends that information retrieval experiments be run over multiple test collections, to support, if not prove, that gains in performance are likely to generalize to other collections or tasks. However, because of the pooling assumptions, evaluation scores are not directly comparable across different test collections. We present a widely-used statistical tool, \em meta-analysis, as a framework for reporting results from IR experiments using multiple test collections. We demonstrate the meta-analytical approach through two standard experiments on stemming and pseudo-relevance feedback, and compare the results to those obtained from score standardization. Meta-analysis incorporates several recent recommendations in the literature, including score standardization, reporting effect sizes rather than score differences, and avoiding a reliance on null-hypothesis statistical testing, in a unified approach. It therefore represents an important methodological improvement over using these techniques in isolation.

Ian Soboroff | I. Soboroff

[1] N. Laird,et al. Meta-analysis in clinical trials revisited. , 2015, Contemporary clinical trials.

[2] Andrew Trotman,et al. SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR) , 2015, SIGIR.

[3] J. J. Rocchio,et al. Relevance feedback in information retrieval , 1971 .

[4] Joel L. Fagan,et al. Automatic Phrase Indexing for Document Retrieval: An Examination of Syntactic and Non-Syntactic Methods , 1987, SIGIR.

[5] Andrew Trotman,et al. Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR) , 2016, SIGF.

[6] Tetsuya Sakai,et al. A Simple and Effective Approach to Score Standardisation , 2016, ICTIR.

[7] Therese D. Pigott,et al. How Many Studies Do You Need? , 2010 .

[8] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[9] Ellen M. Voorhees,et al. Evaluating evaluation measure stability , 2000, SIGIR '00.

[10] Tetsuya Sakai,et al. Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[11] Amit Singhal. AT&T at TREC-6 , 1997, TREC.