The Philosophy of Information Retrieval Evaluation

Evaluation conferences such as TREC, CLEF, and NTCIR are modern examples of the Cranfield evaluation paradigm. In Cranfield, researchers perform experiments on test collections to compare the relative effectiveness of different retrieval approaches. The test collections allow the researchers to control the effects of different system parameters, increasing the power and decreasing the cost of retrieval experiments as compared to user-based evaluations. This paper reviews the fundamental assumptions and appropriate uses of the Cranfield paradigm, especially as they apply in the context of the evaluation conferences.

[1]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[2]  Linda Schamber Relevance and Information Behavior. , 1994 .

[3]  M. Taube A note on the pseudo‐mathematics of relevance , 1965 .

[4]  Noriko Kando,et al.  Overview of IR tasks , 1999, NTCIR.

[5]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[6]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[7]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[8]  Martin Braschler,et al.  CLEF 2000 - Overview of Results , 2000, CLEF.

[9]  Stephen P. Harter,et al.  Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness , 1996, J. Am. Soc. Inf. Sci..

[10]  Michael E. Lesk,et al.  Relevance assessments and retrieval system evaluation , 1968, Inf. Storage Retr..

[11]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[12]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[13]  Noriko Kando,et al.  The NTCIR Workshop : the First Evaluation Workshop on Japanese Text Retrieval and Cross-Lingual Information Retrieval , 1999 .

[14]  C. A. Cuadra,et al.  OPENING THE BLACK BOX OF ‘RELEVANCE’ , 1967 .

[15]  Andrew Turpin,et al.  Why batch and user evaluations do not give the same results , 2001, SIGIR '01.

[16]  Andrew Turpin,et al.  Do batch and user evaluations give the same results? , 2000, SIGIR '00.

[17]  Donna K. Harman,et al.  Overview of the Fourth Text REtrieval Conference (TREC-4) , 1995, TREC.

[18]  Cyril W. Cleverdon,et al.  The significance of the Cranfield tests on index languages , 1991, SIGIR '91.

[19]  Carol Peters,et al.  Cross-Language Information Retrieval and Evaluation , 2001, Lecture Notes in Computer Science.

[20]  Donna Harman,et al.  The fourth text REtrieval conference , 1996 .

[21]  Ellen M. Voorhees,et al.  Overview of TREC 2001 , 2001, TREC.

[22]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[23]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[24]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .