Using Replicates in Information Retrieval Evaluation

This article explores a method for more accurately estimating the main effect of the system in a typical test-collection-based evaluation of information retrieval systems, thus increasing the sensitivity of system comparisons. Randomly partitioning the test document collection allows for multiple tests of a given system and topic (replicates). Bootstrap ANOVA can use these replicates to extract system-topic interactions—something not possible without replicates—yielding a more precise value for the system effect and a narrower confidence interval around that value. Experiments using multiple TREC collections demonstrate that removing the topic-system interactions substantially reduces the confidence intervals around the system effect as well as increases the number of significant pairwise differences found. Further, the method is robust against small changes in the number of partitions used, against variability in the documents that constitute the partitions, and the measure of effectiveness used to quantify system effectiveness.

[1]  Leonid Boytsov,et al.  Deciding on an adjustment for multiplicity in IR experiments , 2013, SIGIR.

[2]  Ellen M. Voorhees,et al.  Retrieval System Evaluation , 2005 .

[3]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[4]  Ben Carterette,et al.  Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.

[5]  Alistair Moffat,et al.  EvaluatIR: an online tool for evaluating and comparing IR systems , 2009, SIGIR.

[6]  Ying Zhang,et al.  Differences in effectiveness across sub-collections , 2012, CIKM.

[7]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[8]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[9]  Alistair Moffat,et al.  Statistical power in retrieval experimentation , 2008, CIKM '08.

[10]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[11]  Peter Bailey,et al.  User Variability and IR System Evaluation , 2015, SIGIR.

[12]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[13]  Alistair Moffat,et al.  Score standardization for inter-collection comparison of retrieval systems , 2008, SIGIR '08.

[14]  Gordon V. Cormack,et al.  Statistical precision of information retrieval evaluation , 2006, SIGIR.

[15]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[16]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[17]  Tetsuya Sakai,et al.  A Simple and Effective Approach to Score Standardisation , 2016, ICTIR.

[18]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[19]  Paul Over,et al.  Blind Men and Elephants: Six Approaches to TREC data , 1999, Information Retrieval.

[20]  Tetsuya Sakai,et al.  Statistical reform in information retrieval? , 2014, SIGF.

[21]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[22]  Noriko Kando,et al.  On information retrieval metrics designed for evaluation with incomplete relevance assessments , 2008, Information Retrieval.

[23]  Ben Carterette,et al.  System effectiveness, user models, and user utility: a conceptual framework for investigation , 2011, SIGIR.

[24]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[25]  J. Neter,et al.  Applied Linear Regression Models , 1983 .

[26]  Ben Carterette Bayesian Inference for Information Retrieval Evaluation , 2015, ICTIR.

[27]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[28]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[29]  Alistair Moffat,et al.  Users versus models: what observation tells us about effectiveness metrics , 2013, CIKM.

[30]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[31]  Ben Carterette,et al.  Simulating simple user behavior for system effectiveness evaluation , 2011, CIKM '11.

[32]  Stephen Robertson,et al.  ON DOCUMENT POPULATIONS AND MEASURES OF IR EFFECTIVENESS , 2007 .

[33]  Stephen E. Robertson,et al.  On per-topic variance in IR evaluation , 2012, SIGIR '12.

[34]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.