Variations in relevance assessments and the measurement of retrieval effectiveness

The purpose of this article is to bring attention to the problem of variations in relevance assessments and the effects that these may have on measures of retrieval effectiveness. Through an analytical review of the literature, I show that despite known wide variations in relevance assessments in experimental test collections, their effects on the measurement of retrieval performance are almost completely unstudied. I will further argue that what we know about the many variables that have been found to affect relevance assessments under experimental conditions, as well as our new understanding of psychological, situational, user‐based relevance, point to a single conclusion. We can no longer rest the evaluation of information retrieval systems on the assumption that such variations do not significantly affect the measurement of information retrieval performance. A series of thorough, rigorous, and extensive tests is needed, of precisely how, and under what conditions, variations in relevance assessments do, and do not, affect measures of retrieval performance. We need to develop approaches to evaluation that are sensitive to these variations and to human factors and individual differences more generally. Our approaches to evaluation must reflect the real world of real users. © 1996 John Wiley & Sons, Inc.

[1]  C. D. Gull Seven years of work on the organization of materials in the special library , 1956 .

[2]  S. P. Harter Psychological relevance and information science , 1992 .

[3]  Tefko Saracevic,et al.  Individual Differences in Organizing, Searching and Retrieving Information. , 1991 .

[4]  Helen R. Tibbo,et al.  The Cystic Fibrosis Database: Content and Research Opportunities. , 1991 .

[5]  Linda Schamber Relevance and Information Behavior. , 1994 .

[6]  Robert Burgin Variations in Relevance Judgments and the Evaluation of Retrieval Performance , 1992, Inf. Process. Manag..

[7]  Gerard Salton,et al.  The State of Retrieval System Evaluation , 1992, Inf. Process. Manag..

[8]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[9]  Douglas G. Schultz,et al.  A Field Experimental Approach to the Study of Relevance Assessments in Relation to Document Searching. Final Report to the National Science Foundation. Volume II, Appendices. , 1967 .

[10]  Stephen E. Robertson,et al.  On the Evaluation of IR Systems , 1992, Inf. Process. Manag..

[11]  D. R. Swanson Historical note: information retrieval and the future of an illusion , 1997 .

[12]  Stephen P. Harter,et al.  Search term combinations and retrieval overlap : a proposed methodology and case study , 1990 .

[13]  Jean Tague-Sutcliffe,et al.  The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..

[14]  P. E. Jones,et al.  Study and test of a methodology for laboratory evaluation of message retrieval systems. ESD-TR-66-405. , 1966, Technical documentary report. United States. Air Force. Systems Command. Electronic Systems Division.

[15]  Stephen P. Harter,et al.  The Cranfield II Relevance Assessments: A Critical Evaluation , 1971, The Library Quarterly.

[16]  K. Markey Interindexer consistency tests: a literature review and report of a test of consistency in indexing visual materials , 1984 .

[17]  Katherine W. McCain,et al.  Descriptor and citation retrieval in the medical behavioral sciences literature:retrieval overlaps and novelty distribution , 1989 .

[18]  Don R. Swanson,et al.  Information Retrieval as a Trial-And-Error Process , 1977, The Library Quarterly.

[19]  Tefko Saracevic,et al.  RELEVANCE: A review of and a framework for the thinking on the notion in information science , 1997, J. Am. Soc. Inf. Sci..

[20]  Michael B. Eisenberg,et al.  A re-examination of relevance: toward a dynamic, situational definition , 1990, Inf. Process. Manag..

[21]  Michael E. Lesk,et al.  Relevance assessments and retrieval system evaluation , 1968, Inf. Storage Retr..

[22]  Don R. Swanson,et al.  Subjective versus Objective Relevance in Bibliographic Retrieval Systems , 1986, The Library Quarterly.

[23]  Paul B. Kantor,et al.  A Study of Information Seeking and Retrieving. III. Searchers, Searches, and Overlap* , 1988 .

[24]  Don R. Swanson,et al.  Some Unexplained Aspects of the Cranfield Tests of Indexing Performance Factors , 1971, The Library Quarterly.

[25]  Steve Smithson,et al.  Information Retrieval Evaluation in Practice: A Case Study Approach , 1994, Inf. Process. Manag..