Score adjustment for correction of pooling bias

Information retrieval systems are evaluated against test collections of topics, documents, and assessments of which documents are relevant to which topics. Documents are chosen for relevance assessment by pooling runs from a set of existing systems. New systems can return unassessed documents, leading to an evaluation bias against them. In this paper, we propose to estimate the degree of bias against an unpooled system, and to adjust the system's score accordingly. Bias estimation can be done via leave-one-out experiments on the existing, pooled systems, but this requires the problematic assumption that the new system is similar to the existing ones. Instead, we propose that all systems, new and pooled, be fully assessed against a common set of topics, and the bias observed against the new system on the common topics be used to adjust scores on the existing topics. We demonstrate using resampling experiments on TREC test sets that our method leads to a marked reduction in error, even with only a relatively small number of common topics, and that the error decreases as the number of topics increases.

[1]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[2]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[3]  Marc Najork,et al.  Efficient and effective link analysis with precomputed salsa maps , 2008, CIKM '08.

[4]  Alistair Moffat,et al.  Strategic system comparisons via targeted relevance judgments , 2007, SIGIR.

[5]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[6]  Tetsuya Sakai,et al.  Alternatives to Bpref , 2007, SIGIR.

[7]  James Allan,et al.  Evaluation over thousands of queries , 2008, SIGIR '08.

[8]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[9]  J. Aslam,et al.  A Practical Sampling Strategy for Efficient Retrieval Evaluation , 2007 .

[10]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[11]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[12]  Tetsuya Sakai Comparing metrics across TREC and NTCIR: the robustness to system bias , 2008, CIKM '08.

[13]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[14]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[15]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[16]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[17]  Ben Carterette,et al.  Robust test collections for retrieval evaluation , 2007, SIGIR.

[18]  Ellen M. Voorhees,et al.  Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[19]  Alistair Moffat,et al.  Statistical power in retrieval experimentation , 2008, CIKM '08.