How reliable are the results of large-scale information retrieval experiments?

Two stages in measurement of techniques for informationretrieval are gathering of documents for relevance assessment anduse of the assessments to numerically evaluate effectiveness. Weconsider both of these stages in the context of the TRECexperiments, to determine whether they lead to measurements thatare trustworthy and fair. Our detailed empirical investigation ofthe TREC results shows that the measured relative performance ofsystems appears to be reliable, but that recall is overestimated:it is likely that many relevant documents have not been found. Wepropose a new pooling strategy that can significantly in- creasethe number of relevant documents found for given effort, withoutcompromising fairness.

[1]  Michael E. Lesk,et al.  Relevance assessments and retrieval system evaluation , 1968, Inf. Storage Retr..

[2]  Don R. Swanson,et al.  Some Unexplained Aspects of the Cranfield Tests of Indexing Performance Factors , 1971, The Library Quarterly.

[3]  Stephen P. Harter,et al.  The Cranfield II Relevance Assessments: A Critical Evaluation , 1971, The Library Quarterly.

[4]  Peter Urbach,et al.  Scientific Reasoning: The Bayesian Approach , 1989 .

[5]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[6]  Jean Tague-Sutcliffe,et al.  The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..

[7]  Gerard Salton,et al.  The State of Retrieval System Evaluation , 1992, Inf. Process. Manag..

[8]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[9]  James Blustein,et al.  A Statistical Analysis of the TREC-3 Data , 1995, TREC.

[10]  Donna Harman Overview of the second text retrieval conference (TREC-2) , 1994 .

[11]  Alistair Moffat,et al.  Efficient Retrieval of Partial Documents , 1995, Inf. Process. Manag..

[12]  Donna K. Harman,et al.  Overview of the Fourth Text REtrieval Conference (TREC-4) , 1995, TREC.

[13]  J. Thom Relevance Judgements for Assessing Recall , 1995 .

[14]  Donna K. Harman,et al.  Overview of the Fifth Text REtrieval Conference (TREC-5) , 1996, TREC.

[15]  David C. Blair STAIRS redux: thoughts on the STAIRS evaluation, ten years after , 1996 .

[16]  James A. Thom,et al.  Relevance Judgments for Assessing Recall , 1996, Inf. Process. Manag..

[17]  Stephen P. Harter,et al.  Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness , 1996, J. Am. Soc. Inf. Sci..

[18]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..