Can Deep Effectiveness Metrics Be Evaluated Using Shallow Judgment Pools?

Increasing test collection sizes and limited judgment budgets create measurement challenges for IR batch evaluations, challenges that are greater when using deep effectiveness metrics than when using shallow metrics, because of the increased likelihood that unjudged documents will be encountered. Here we study the problem of metric score adjustment, with the goal of accurately estimating system performance when using deep metrics and limited judgment sets, assuming that dynamic score adjustment is required per topic due to the variability in the number of relevant documents. We seek to induce system orderings that are as close as is possible to the orderings that would arise if full judgments were available. Starting with depth-based pooling, and no prior knowledge of sampling probabilities, the first phase of our two-stage process computes a background gain for each document based on rank-level statistics. The second stage then accounts for the distributional variance of relevant documents. We also exploit the frequency statistics of pooled relevant documents in order to determine a threshold for dynamically determining the set of topics to be adjusted. Taken together, our results show that: (i) better score estimates can be achieved when compared to previous work; (ii) by setting a global threshold, we are able to adapt our methods to different collections; and (iii) the proposed estimation methods reliably approximate the system orderings achieved when many more relevance judgments are available. We also consider pools generated by a two-strata sampling approach.

[1]  Thorsten Joachims,et al.  Unbiased Comparative Evaluation of Ranking Functions , 2016, ICTIR.

[2]  Tetsuya Sakai,et al.  Alternatives to Bpref , 2007, SIGIR.

[3]  Allan Hanbury,et al.  The Solitude of Relevant Documents in the Pool , 2016, CIKM.

[4]  Emine Yilmaz,et al.  Estimating average precision when judgments are incomplete , 2007, Knowledge and Information Systems.

[5]  Ellen M. Voorhees,et al.  Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[6]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[7]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[8]  J. Shane Culpepper,et al.  Improving test collection pools with machine learning , 2014, ADCS.

[9]  Tetsuya Sakai Comparing metrics across TREC and NTCIR: the robustness to system bias , 2008, CIKM '08.

[10]  Alistair Moffat,et al.  Users versus models: what observation tells us about effectiveness metrics , 2013, CIKM.

[11]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[12]  Alistair Moffat,et al.  Score Estimation, Incomplete Judgments, and Significance Testing in IR Evaluation , 2010, AIRS.

[13]  Stephen Robertson,et al.  ON DOCUMENT POPULATIONS AND MEASURES OF IR EFFECTIVENESS , 2007 .

[14]  Tetsuya Sakai Comparing metrics across TREC and NTCIR:: the robustness to pool depth bias , 2008, SIGIR '08.

[15]  Dirk Van,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[16]  Thorsten Joachims,et al.  Unbiased Ranking Evaluation on a Budget , 2015, WWW.

[17]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[18]  Allan Hanbury,et al.  Splitting Water: Precision and Anti-Precision to Reduce Pool Bias , 2015, SIGIR.

[19]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[20]  A. Chao,et al.  Estimating the Number of Classes via Sample Coverage , 1992 .

[21]  Nicola Ferro,et al.  Bridging Between Information Retrieval and Databases , 2014, Lecture Notes in Computer Science.

[22]  Ellen M. Voorhees,et al.  The effect of sampling strategy on inferred measures , 2014, SIGIR.

[23]  Charles L. A. Clarke,et al.  Reliable information retrieval evaluation with incomplete and biased judgements , 2007, SIGIR.

[24]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[25]  J. Shane Culpepper,et al.  The effect of pooling and evaluation depth on IR metrics , 2016, Information Retrieval Journal.

[26]  Laurence Anthony F. Park,et al.  Score adjustment for correction of pooling bias , 2009, SIGIR.

[27]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[28]  Alistair Moffat,et al.  Strategic system comparisons via targeted relevance judgments , 2007, SIGIR.

[29]  Tetsuya Sakai,et al.  Metrics, Statistics, Tests , 2013, PROMISE Winter School.

[30]  J. Shane Culpepper,et al.  Modeling Relevance as a Function of Retrieval Rank , 2016, AIRS.

[31]  Sergei Vassilvitskii,et al.  Generalized distances between rankings , 2010, WWW '10.