Measuring the Agreement Among Relevance Judges

The importance of the issue of the agreement (or disagreement) between relevance judges is increasing, since new kinds of relevance judgment expression are being used (to the classical dichotomous one, various researches have added scalar, weighted, and orders of various kind) and new media are being introduced (it is far quicker to judge the relevance of an image than a text, and thus the human judgments can be obtained more easily). This paper presents a coherent account of the disagreement between relevance judges and groups of judges. Judgment expressions of different kinds, grouped into two categories, are taken into account. To the first category, score judgments, belong the more classical dichotomous, scalar, and weighted. To the second one, order judgments, belong total (or linear) and partial (or weak) orders, both with or without equality. A uniform notation for describing relevance judgments of each kind is proposed; some of the problems arising when one tries to operationally measure the disagreement between judges are described; a measure for the disagreement of two judges expressing two judgments of the same kind is proposed; the disagreement of a group of more than two judges is discussed; and, finally, some experimental activity inspired by this study is sketched.

[1]  Joseph W. Janes,et al.  Other People's Judgments: A Comparison of Users' and Others' Judgments of Document Relevance, Topicality, and Utility , 1994, J. Am. Soc. Inf. Sci..

[2]  Regina Célia Figueiredo Estudo comparativo de julgamentos de relevância do usuário e não-usuário de serviços de D. S. I. , 1978 .

[3]  S. Gabrielli,et al.  Negotiating a Multidimensional Framework for Relevance Space , 1999, MIRA.

[4]  Mark E. Rorvig,et al.  The simple scalability of documents , 1990, J. Am. Soc. Inf. Sci..

[5]  Michael E. Lesk,et al.  Relevance assessments and retrieval system evaluation , 1968, Inf. Storage Retr..

[6]  Stephen P. Harter,et al.  Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness , 1996, J. Am. Soc. Inf. Sci..

[7]  J. Janes Other people's judgments: a comparison of users' and others' judgments of document relevance, topicality, and utility , 1994 .

[8]  Stefano Mizzaro,et al.  How many relevances in information retrieval? , 1998, Interact. Comput..

[9]  Stephen P. Harter Variations in relevance assessments and the measurement of retrieval effectiveness , 1996 .

[10]  Joseph Janes,et al.  Relevance Judgments of Actual Users and Secondary Judges: A Comparative Study , 1992, The Library Quarterly.

[11]  John Marion Hoffman,et al.  Experimental design for measuring the intra- and inter-group consistency of human judgment of relevance , 1965 .

[12]  John O'Connor,et al.  Relevance disagreements and unclear request forms , 1967 .

[13]  Robert Burgin Variations in Relevance Judgments and the Evaluation of Retrieval Performance , 1992, Inf. Process. Manag..

[14]  John O'Connor Some independent agreements and resolved disagreements about answer‐providing documents , 1969 .

[15]  Yiyu Yao Measuring retrieval effectiveness based on user preference of documents , 1995 .