Evaluation of retrieval effectiveness with incomplete relevance data: Theoretical and experimental comparison of three measures

This paper investigates two relatively new measures of retrieval effectiveness in relation to the problem of incomplete relevance data. The measures, Bpref and RankEff, which do not take into account documents that have not been relevance judged, are compared theoretically and experimentally. The experimental comparisons involve a third measure, the well-known mean uninterpolated average precision. The results indicate that RankEff is the most stable of the three measures when the amount of relevance data is reduced, with respect to system ranking and absolute values. In addition, RankEff has the lowest error-rate.

[1]  Jean Tague-Sutcliffe,et al.  The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..

[2]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[3]  R. Forthofer,et al.  Rank Correlation Methods , 1981 .

[4]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[5]  Leif Grönqvist Evaluating Latent Semantic Vector Models with Synonym Tests and Document Retrieval , 2005 .

[6]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[7]  Ellen M. Voorhees,et al.  The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[8]  Cyril W. Cleverdon,et al.  The significance of the Cranfield tests on index languages , 1991, SIGIR '91.

[9]  M. Kendall,et al.  Rank Correlation Methods (5th ed.). , 1992 .

[10]  Per Ahlgren,et al.  Retrieval evaluation with incomplete relevance data: a comparative study of three measures , 2006, CIKM '06.

[11]  Ellen M. Voorhees,et al.  The Twelfth Text Retrieval Conference, TREC 2003 , 2004 .

[12]  E. Michael Keen,et al.  Presenting Results of Experimental Retrieval Comparisons , 1997, Inf. Process. Manag..

[13]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[14]  Per Ahlgren,et al.  Measuring retrieval effectiveness with incomplete relevance data , 2006 .

[15]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[16]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[17]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, SIGIR '00.