Estimating average precision with incomplete and imperfect judgments

We consider the problem of evaluating retrieval systems using incomplete judgment information. Buckley and Voorhees recently demonstrated that retrieval systems can be efficiently and effectively evaluated using incomplete judgments via the bpref measure [6]. When relevance judgments are complete, the value of bpref is an approximation to the value of average precision using complete judgments. However, when relevance judgments are incomplete, the value of bpref deviates from this value, though it continues to rank systems in a manner similar to average precision evaluated with a complete judgment set. In this work, we propose three evaluation measures that (1) are approximations to average precision even when the relevance judgments are incomplete and (2) are more robust to incomplete or imperfect relevance judgments than bpref. The proposed estimates of average precision are simple and accurate, and we demonstrate the utility of these estimates using TREC data.

[1]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[2]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[3]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[4]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[5]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[6]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[7]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[8]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[9]  Susan T. Dumais,et al.  Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval , 2004, SIGIR 2004.

[10]  Javed A. Aslam,et al.  A unified model for metasearch, pooling, and system evaluation , 2003, CIKM '03.

[11]  Rabia Nuray-Turan,et al.  Automatic ranking of retrieval systems in imperfect environments , 2003, SIGIR '03.

[12]  Javed A. Aslam,et al.  On the effectiveness of evaluating retrieval systems in the absence of relevance judgments , 2003, SIGIR.

[13]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[14]  Stephen E. Robertson,et al.  On Collection Size and Retrieval Effectiveness , 2004, Information Retrieval.

[15]  Charles L. A. Clarke,et al.  The TREC 2005 Terabyte Track , 2005, TREC.

[16]  Emine Yilmaz,et al.  The maximum entropy method for analyzing retrieval measures , 2005, SIGIR '05.

[17]  Charles L. A. Clarke,et al.  The TREC 2006 Terabyte Track , 2006, TREC.

[18]  Hard Track Overview in Trec 2004 High Accuracy Retrieval from Documents Hard Track Overview in Trec 2004 High Accuracy Retrieval from Documents , .