Stochastic Simulation of Test Collections: Evaluation Scores

Part of Information Retrieval evaluation research is limited by the fact that we do not know the distributions of system effectiveness over the populations of topics and, by extension, their true mean scores. The workaround usually consists in resampling topics from an existing collection and approximating the statistics of interest with the observations made between random subsamples, as if one represented the population and the other a random sample. However, this methodology is clearly limited by the availability of data, the impossibility to control the properties of these data, and the fact that we do not really measure what we intend to. To overcome these limitations, we propose a method based on vine copulas for stochastic simulation of evaluation results where the true system distributions are known upfront. In the basic use case, it takes the scores from an existing collection to build a semi-parametric model representing the set of systems and the population of topics, which can then be used to make realistic simulations of the scores by the same systems but on random new topics. Our ability to simulate this kind of data not only eliminates the current limitations, but also offers new opportunities for research. As an example, we show the benefits of this approach in two sample applications replicating typical experiments found in the literature. We provide a full R package to simulate new data following the proposed method, which can also be used to fully reproduce the results in this paper.

[1]  Ben Carterette The Best Published Result is Random: Sequential Testing and its Effect on Reported Effectiveness , 2015, SIGIR.

[2]  J. Pedoe,et al.  Sequential Methods in Statistics , 1966 .

[3]  M. Sklar Fonctions de repartition a n dimensions et leurs marges , 1959 .

[4]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[5]  M. Wand,et al.  Multivariate plug-in bandwidth selection , 1994 .

[6]  Guohua Pan,et al.  Local Regression and Likelihood , 1999, Technometrics.

[7]  Jean M. Tague,et al.  The pragmatics of information retrieval experimentation , 1981 .

[8]  Jean Tague-Sutcliffe,et al.  Problems in the simulation of bibliographic retrieval systems , 1980, SIGIR '80.

[9]  James Allan,et al.  If I Had a Million Queries , 2009, ECIR.

[10]  Alistair Moffat,et al.  Statistical power in retrieval experimentation , 2008, CIKM '08.

[11]  J. V. Ryzin,et al.  A class of smooth estimators for discrete distributions , 1981 .

[12]  C. Czado,et al.  Truncated regular vines in high dimensions with application to financial data , 2012 .

[13]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[14]  Gordon V. Cormack,et al.  Validity and power of t-test for comparing MAP and GMAP , 2007, SIGIR.

[15]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[16]  M. Evans Statistical Distributions , 2000 .

[17]  Julián Urbano,et al.  Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation , 2016, Information Retrieval Journal.

[18]  Evangelos Kanoulas,et al.  Empirical justification of the gain and discount function for nDCG , 2009, CIKM.

[19]  Michael D. Cooper,et al.  A simulation model of an information retrieval system , 1973, Inf. Storage Retr..

[20]  Tetsuya Sakai,et al.  Topic set size design , 2015, Information Retrieval Journal.

[21]  Ying Zhang,et al.  Differences in effectiveness across sub-collections , 2012, CIKM.

[22]  H. Akaike A new look at the statistical model identification , 1974 .

[23]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[24]  Song-xi Chen,et al.  Beta kernel estimators for density functions , 1999 .

[25]  T. Bedford,et al.  Vines: A new graphical model for dependent random variables , 2002 .

[26]  Brian Peacock,et al.  Statistical Distributions: Forbes/Statistical Distributions 4E , 2010 .

[27]  Ben Carterette Bayesian Inference for Information Retrieval Evaluation , 2015, ICTIR.

[28]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, SIGIR '00.

[29]  Stephen E. Robertson,et al.  On per-topic variance in IR evaluation , 2012, SIGIR '12.

[30]  P. Embrechts,et al.  Dependence modeling with copulas , 2007 .

[31]  M. Wand Local Regression and Likelihood , 2001 .

[32]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[33]  M. de Rijke,et al.  Building simulated queries for known-item topics: an analysis using six european languages , 2007, SIGIR.

[34]  Florence March,et al.  2016 , 2016, Affair of the Heart.

[35]  A. Frigessi,et al.  Pair-copula constructions of multiple dependence , 2009 .

[36]  Mark D. Smucker,et al.  Report on the SIGIR 2010 workshop on the simulation of interaction , 2011, SIGF.

[37]  Pramodita Sharma 2012 , 2013, Les 25 ans de l’OMC: Une rétrospective en photos.

[38]  Ellen M. Voorhees,et al.  Topic set size redux , 2009, SIGIR.

[39]  Claudia Czado,et al.  Selecting and estimating regular vine copulae and application to financial returns , 2012, Comput. Stat. Data Anal..

[40]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[41]  Alistair Moffat,et al.  Score standardization for inter-collection comparison of retrieval systems , 2008, SIGIR '08.

[42]  Douglas W. Oard,et al.  Sequential testing in classifier evaluation yields biased estimates of effectiveness , 2013, SIGIR.

[43]  Mónica Marrero,et al.  A comparison of the optimality of statistical significance tests for information retrieval evaluation , 2013, SIGIR.

[44]  Mónica Marrero,et al.  Toward Estimating the Rank Correlation between the Test Collection Results and the True System Performance , 2016, SIGIR.

[45]  M. Gribaudo,et al.  2002 , 2001, Cell and Tissue Research.

[46]  Ben Carterette,et al.  Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.

[47]  A. James 2010 , 2011, Philo of Alexandria: an Annotated Bibliography 2007-2016.

[48]  Jean Tague-Sutcliffe,et al.  The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..