Using graded relevance assessments in IR evaluation

This article proposes evaluation methods based on the use of nondichotomous relevance judgements in IR experiments. It is argued that evaluation methods should credit IR methods for their ability to retrieve highly relevant documents. This is desirable from the user point of view in modern large IR environments. The proposed methods are (1) a novel application of P-R curves and average precision computations based on separate recall bases for documents of different degrees of relevance, and (2) generalized recall and precision based directly on multiple grade relevance assessments (i.e., not dichotomizing the assessments). We demonstrate the use of the traditional and the novel evaluation measures in a case study on the effectiveness of query types, based on combinations of query structures and expansion, in retrieving documents of various degrees of relevance. The test was run with a best match retrieval system (InQuery1) in a text database consisting of newspaper articles. To gain insight into the retrieval process, one should use both graded relevance assessments and effectiveness measures that enable one to observe the differences, if any, between retrieval methods in retrieving documents of different levels of relevance. In modern times of information overload, one should pay attention, in particular, to the capability of retrieval methods retrieving highly relevant documents.

[1]  Linda Schamber Relevance and Information Behavior. , 1994 .

[2]  Rong Tang,et al.  Towards the Identification of the Optimal Number of Relevance Categories , 1999, J. Am. Soc. Inf. Sci..

[3]  Pia Borlund,et al.  Experimental components for the evaluation of interactive information retrieval systems , 2000, J. Documentation.

[4]  Rebecca Green The expression of Conceptual Syntagmatic Relationships: a Comparative Survey , 1995, J. Documentation.

[5]  E. Michael Keen,et al.  The Use of Term position Devices in Ranked output Experiments , 1991, J. Documentation.

[6]  Peiling Wang,et al.  A cognitive model of document use during a research project. Study I. document selection , 1998 .

[7]  W. Bruce Croft,et al.  Combining automatic and manual index representations in probabilistic retrieval , 1995 .

[8]  Pertti Vakkari,et al.  Changes in relevance criteria and problem stages in task performance , 2000, J. Documentation.

[9]  David J. Groggel,et al.  Practical Nonparametric Statistics , 2000, Technometrics.

[10]  C. J. van Rijsbergen,et al.  Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval , 1987, SIGIR 1987.

[11]  Amanda Spink,et al.  Examining Different Regions of Relevance: From Highly Relevant to Not Relevant , 1998 .

[12]  Jaana Kekäläinen,et al.  The impact of query structure and query expansion on retrieval performance , 1998, SIGIR '98.

[13]  Steve Smithson,et al.  Information Retrieval Evaluation in Practice: A Case Study Approach , 1994, Inf. Process. Manag..

[14]  Michael B. Eisenberg Measuring relevance judgments , 1988, Inf. Process. Manag..

[15]  Jaana Kekäläinen,et al.  The Co-Effects of Query Structure and Expansion on Retrieval Performance in Probabilistic Text Retrieval , 2004, Information Retrieval.

[16]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[17]  Eero Sormunen,et al.  Liberal relevance criteria of TREC -: counting on negligible documents? , 2002, SIGIR '02.

[18]  Eero Sormunen,et al.  A Method for Measuring Wide Range Performance of Boolean Queries in Full-Text Databases , 2000 .

[19]  James Allan,et al.  INQUERY at TREC-5 , 1996, TREC.

[20]  William R. Hersh,et al.  An Evaluation of Interactive Boolean and Natural Language Searching with an Online Medical Textbook , 1995, J. Am. Soc. Inf. Sci..

[21]  Jaana Kekäläinen,et al.  Document text characteristics affect the ranking of the most relevant documents by expanded structured queries , 2001, J. Documentation.

[22]  Carol L. Barry,et al.  Users' Criteria for Relevance Evaluation: A Cross-situational Comparison , 1998, Inf. Process. Manag..

[23]  M. E. Maron,et al.  An evaluation of retrieval effectiveness for a full-text document-retrieval system , 1985, CACM.

[24]  Rong Tang,et al.  Towards the Identification of the Optimal Number of Relevance Categories , 1999, J. Am. Soc. Inf. Sci..

[25]  Paul B. Kantor,et al.  A study of information seeking and retrieving. I. background and methodology , 1988 .

[26]  Carol L. Barry User-defined relevance criteria: an exploratory study , 1994 .

[27]  P. Willett,et al.  An Introduction to Algorithmic and Cognitive Approaches for Information Retrieval , 1995 .

[28]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[29]  R. V. Katter The influence of scale form on relevance judgments , 1968, Inf. Storage Retr..

[30]  Pertti Vakkari,et al.  Changes in Search Tactics and Relevance Judgements when Preparing a Research Proposal A Summary of the Findings of a Longitudinal Study , 2001, Information Retrieval.

[31]  Michael B. Eisenberg,et al.  A re-examination of relevance: toward a dynamic, situational definition , 1990, Inf. Process. Manag..

[32]  Judy Bateman Changes in Relevance Criteria: A Longitudinal Study. , 1998 .

[33]  Pia Borlund,et al.  Evaluation of interactive information retrieval systems , 2000 .

[34]  R. A. Groeneveld,et al.  Practical Nonparametric Statistics (2nd ed). , 1981 .

[35]  Peter Ingwersen,et al.  The Application of Work Tasks in Connection with the Evaluation of Interactive Information Retrieval Systems: Empirical Results , 1999, MIRA.

[36]  Douglas G. Schultz,et al.  A Field Experimental Approach to the Study of Relevance Assessments in Relation to Document Searching. Final Report to the National Science Foundation. Volume II, Appendices. , 1967 .

[37]  Tadeusz Radecki Trends in research on information retrieval -- The potential for improvements in conventional Boolean retrieval systems , 1988, Inf. Process. Manag..

[38]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[39]  Jaana Kekäläinen,et al.  The effects of query complexity, expansion and structure on retrieval performance in probabilistic text retrieval , 1999 .

[40]  Robert M. Losee Text retrieval and filtering: analytic models of performance , 1998 .

[41]  Peter Ingwersen,et al.  Dimensions of relevance , 2000, Inf. Process. Manag..