论文信息 - A simple and efficient sampling method for estimating AP and NDCG

A simple and efficient sampling method for estimating AP and NDCG

We consider the problem of large scale retrieval evaluation. Recently two methods based on random sampling were proposed as a solution to the extensive effort required to judge tens of thousands of documents. While the first method proposed by Aslam et al. [1] is quite accurate and efficient, it is overly complex, making it difficult to be used by the community, and while the second method proposed by Yilmaz et al., infAP [14], is relatively simple, it is less efficient than the former since it employs uniform random sampling from the set of complete judgments. Further, none of these methods provide confidence intervals on the estimated values. The contribution of this paper is threefold: (1) we derive confidence intervals for infAP, (2) we extend infAP to incorporate nonrandom relevance judgments by employing stratified random sampling, hence combining the efficiency of stratification with the simplicity of random sampling, (3) we describe how this approach can be utilized to estimate nDCG from incomplete judgments. We validate the proposed methods using TREC data and demonstrate that these new methods can be used to incorporate nonrandom samples, as were available in TREC Terabyte track '06.

[1] James Allan,et al. Minimal test collections for retrieval evaluation , 2006, SIGIR.

[2] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[3] Ian Soboroff,et al. A comparison of pooled and sampled relevance judgments , 2007, EVIA@NTCIR.

[4] Jaana Kekäläinen,et al. Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[5] Rajesh Shenoy,et al. On the robustness of relevance measures with incomplete judgments , 2007, SIGIR.

[6] D. K. Harmon,et al. Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[7] Emine Yilmaz,et al. A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[8] Tetsuya Sakai,et al. Alternatives to Bpref , 2007, SIGIR.

[9] Alistair Moffat,et al. Strategic system comparisons via targeted relevance judgments , 2007, SIGIR.

[10] B. E. Eckbo,et al. Appendix , 1826, Epilepsy Research.

[11] Paul Over,et al. The TREC VIdeo Retrieval Evaluation (TRECVID): A Case Study and Status Report , 2004, RIAO.

[12] Cyril Cleverdon,et al. The Cranfield tests on index language devices , 1997 .

[13] Charles L. A. Clarke,et al. The TREC 2006 Terabyte Track , 2006, TREC.

[14] Ellen M. Voorhees,et al. Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[15] Emine Yilmaz,et al. Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.