Retrieval evaluation with incomplete information

This paper examines whether the Cranfield evaluation methodology is robust to gross violations of the completeness assumption (i.e., the assumption that all relevant documents within a test collection have been identified and are present in the collection). We show that current evaluation measures are not robust to substantially incomplete relevance judgments. A new measure is introduced that is both highly correlated with existing measures when complete judgments are available and more robust to incomplete judgment sets. This finding suggests that substantially larger or dynamic test collections built using current pooling practices should be viable laboratory tools, despite the fact that the relevance information will be incomplete and imperfect.

[1]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[2]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[3]  Michael B. Eisenberg Measuring relevance judgments , 1988, Inf. Process. Manag..

[4]  Mark E. Rorvig,et al.  The Simple Scalability of Documents. , 1990 .

[5]  Mark E. Rorvig,et al.  The simple scalability of documents , 1990, J. Am. Soc. Inf. Sci..

[6]  Cyril W. Cleverdon,et al.  The significance of the Cranfield tests on index languages , 1991, SIGIR '91.

[7]  Peter Schäuble,et al.  Determining the effectiveness of retrieval algorithms , 1991, Inf. Process. Manag..

[8]  Nicholas J. Belkin,et al.  Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval , 1997, SIGIR 1997.

[9]  Yiyu Yao,et al.  Measuring Retrieval Effectiveness Based on User Preference of Documents , 1995, J. Am. Soc. Inf. Sci..

[10]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[11]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[12]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[13]  Ellen M. Voorhees,et al.  Overview of the Seventh Text REtrieval Conference , 1998 .

[14]  Ellen M. Voorhees,et al.  Overview of the seventh text retrieval conference (trec-7) [on-line] , 1999 .

[15]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[16]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[17]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[18]  Ellen M. Voorhees,et al.  The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[19]  Rabia Nuray-Turan,et al.  Automatic ranking of retrieval systems in imperfect environments , 2003, SIGIR '03.

[20]  S. Mizzaro A NEW MEASURE OF RETRIEVAL EFFECTIVENESS (OR: WHAT’S WRONG WITH PRECISION AND RECALL) , 2004 .