Variations in relevance judgments and the measurement of retrieval effectiveness

Test collections have traditionally been used by information retrieval researchers to improve their retrieval strategies. To be viable as a laboratory tool, a collection must reliably rank diAerent retrieval variants according to their true eAectiveness. In particular, the relative eAectiveness of two retrieval strategies should be insensitive to modest changes in the relevant document set since individual relevance assessments are known to vary widely. The test collections developed in the TREC workshops have become the collections of choice in the retrieval research community. To verify their reliability, NIST investigated the eAect changes in the relevance assessments have on the evaluation of retrieval results. Very high correlations were found among the rankings of systems produced using diAerent relevance judgment sets. The high correlations indicate that the comparative evaluation of retrieval performance is stable despite substantial diAerences in relevance judgments, and thus reaArm the use of the TREC collections as laboratory tools. Published by Elsevier Science Ltd.

[1]  M. Taube A note on the pseudo‐mathematics of relevance , 1965 .

[2]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[3]  Cyril W. Cleverdon,et al.  Aslib Cranfield research project - Factors determining the performance of indexing systems; Volume 1, Design; Part 2, Appendices , 1966 .

[4]  C. A. Cuadra,et al.  OPENING THE BLACK BOX OF ‘RELEVANCE’ , 1967 .

[5]  Michael E. Lesk,et al.  Relevance assessments and retrieval system evaluation , 1968, Inf. Storage Retr..

[6]  Cyril W. Cleverdon The effect of variations in relevance assessments in comparative experimental tests of index languages , 1970 .

[7]  Karen Sparck Jones Information Retrieval Experiment , 1971 .

[8]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[9]  Karen Sparck Jones Automatic Indexing; Progress in Documentation. , 1974 .

[10]  Gerard Salton,et al.  Automatic indexing , 1980, ACM '80.

[11]  Harold Borko,et al.  Automatic indexing , 1981, ACM '81.

[12]  W. Bruce Croft,et al.  Information filtering and information retrieval: two sides of the same coin? , 1992, CACM.

[13]  Robert Burgin Variations in Relevance Judgments and the Evaluation of Retrieval Performance , 1992, Inf. Process. Manag..

[14]  Donna Harman,et al.  The First Text REtrieval Conference (TREC-1) , 1993 .

[15]  Donna K. Harman The First Text REtrieval Conference (TREC-1), Rockville, MD, USA, 4-6 November 1992 , 1993, Inf. Process. Manag..

[16]  Linda Schamber Relevance and Information Behavior. , 1994 .

[17]  Stephen P. Harter Variations in relevance assessments and the measurement of retrieval effectiveness , 1996 .

[18]  Kui-Lam Kwok,et al.  TREC-5 English and Chinese Retrieval Experiments using PIRCS , 1996, TREC.

[19]  James A. Thom,et al.  Relevance Judgments for Assessing Recall , 1996, Inf. Process. Manag..

[20]  Stephen P. Harter,et al.  Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness , 1996, J. Am. Soc. Inf. Sci..

[21]  Charles L. A. Clarke,et al.  Passage-Based Refinement (MultiText Experiements for TREC-6) , 1997, TREC.

[22]  Samuel S. L. To,et al.  Passage-Based Re nement ( MultiText Experiments for TREC-6 ) , 1998 .