论文信息 - Can Deep Effectiveness Metrics Be Evaluated Using Shallow Judgment Pools?

Can Deep Effectiveness Metrics Be Evaluated Using Shallow Judgment Pools?

Increasing test collection sizes and limited judgment budgets create measurement challenges for IR batch evaluations, challenges that are greater when using deep effectiveness metrics than when using shallow metrics, because of the increased likelihood that unjudged documents will be encountered. Here we study the problem of metric score adjustment, with the goal of accurately estimating system performance when using deep metrics and limited judgment sets, assuming that dynamic score adjustment is required per topic due to the variability in the number of relevant documents. We seek to induce system orderings that are as close as is possible to the orderings that would arise if full judgments were available. Starting with depth-based pooling, and no prior knowledge of sampling probabilities, the first phase of our two-stage process computes a background gain for each document based on rank-level statistics. The second stage then accounts for the distributional variance of relevant documents. We also exploit the frequency statistics of pooled relevant documents in order to determine a threshold for dynamically determining the set of topics to be adjusted. Taken together, our results show that: (i) better score estimates can be achieved when compared to previous work; (ii) by setting a global threshold, we are able to adapt our methods to different collections; and (iii) the proposed estimation methods reliably approximate the system orderings achieved when many more relevance judgments are available. We also consider pools generated by a two-strata sampling approach.

J. Shane Culpepper | Alistair Moffat | Xiaolu Lu

[1] Thorsten Joachims,et al. Unbiased Comparative Evaluation of Ranking Functions , 2016, ICTIR.

[2] Tetsuya Sakai,et al. Alternatives to Bpref , 2007, SIGIR.

[3] Allan Hanbury,et al. The Solitude of Relevant Documents in the Pool , 2016, CIKM.

[4] Emine Yilmaz,et al. Estimating average precision when judgments are incomplete , 2007, Knowledge and Information Systems.

[5] Ellen M. Voorhees,et al. Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[6] Justin Zobel,et al. How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[7] Emine Yilmaz,et al. A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[8] J. Shane Culpepper,et al. Improving test collection pools with machine learning , 2014, ADCS.

[9] Tetsuya Sakai. Comparing metrics across TREC and NTCIR: the robustness to system bias , 2008, CIKM '08.

[10] Alistair Moffat,et al. Users versus models: what observation tells us about effectiveness metrics , 2013, CIKM.

[11] Olivier Chapelle,et al. Expected reciprocal rank for graded relevance , 2009, CIKM.