Gauging the Quality of Relevance Assessments using Inter-Rater Agreement

In recent years, gathering relevance judgments through non-topic originators has become an increasingly important problem in Information Retrieval. Relevance judgments can be used to measure the effectiveness of a system, and are often needed to build supervised learning models in learning-to-rank retrieval systems. The two most popular approaches to gathering bronze level judgments - where the judge is not the originator of the information need for which relevance is being assessed, and is not a topic expert - is through a controlled user study, or through crowdsourcing. However, judging comes at a cost (in time, and usually money) and the quality of the judgments can vary widely. In this work, we directly compare the reliability of judgments using three different types of bronze assessor groups. Our first group is a controlled Lab group; the second and third are two different crowdsourcing groups, CF-Document where assessors were free to judge any number of documents for a topic, and CF-Topic where judges were required to judge all of the documents from a single topic, in a manner similar to the Lab group. Our study shows that Lab assessors exhibit a higher level of agreement with a set of ground truth judgments than CF-Topic and CF-Document assessors. Inter-rater agreement rates show analogous trends. These finding suggests that in the absence of ground truth data, agreement between assessors can be used to reliably gauge the quality of relevance judgments gathered from secondary assessors, and that controlled user studies are more likely to produce reliable judgments despite being more costly.

[1]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[2]  Gabriella Kazai,et al.  An analysis of human factors and label accuracy in crowdsourcing relevance judgments , 2013, Information Retrieval.

[3]  J. Shane Culpepper,et al.  The Influence of Topic Difficulty, Relevance Level, and Document Ordering on Relevance Judging , 2016, ADCS.

[4]  Paul Solomon,et al.  Toward an Understanding of the Dynamics of Relevance Judgment: An Analysis of One Person's Search Behavior , 1998, Inf. Process. Manag..

[5]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[6]  Falk Scholer,et al.  Metric and Relevance Mismatch in Retrieval Evaluation , 2009, AIRS.

[7]  Joseph L. Fleiss,et al.  Balanced Incomplete Block Designs for Inter-Rater Reliability Studies , 1981 .

[8]  Eero Sormunen,et al.  Liberal relevance criteria of TREC -: counting on negligible documents? , 2002, SIGIR '02.

[9]  Mark Sanderson,et al.  Relevance judgments between TREC and Non-TREC assessors , 2008, SIGIR '08.

[10]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[11]  J. Shane Culpepper,et al.  The Effect of Document Order and Topic Difficulty on Assessor Agreement , 2016, ICTIR.

[12]  T. Saracevic,et al.  Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance , 2007, J. Assoc. Inf. Sci. Technol..

[13]  Stefano Mizzaro Relevance: the whole history , 1997 .

[14]  Mark Sanderson,et al.  Quantifying test collection quality based on the consistency of relevance judgements , 2011, SIGIR.

[15]  Klaus Krippendorff,et al.  Answering the Call for a Standard Reliability Measure for Coding Data , 2007 .

[16]  J. Shane Culpepper,et al.  The effect of pooling and evaluation depth on IR metrics , 2016, Information Retrieval Journal.

[17]  Jianqiang Wang Accuracy , Agreement , Speed , and Perceived Difficulty of Users ’ Relevance Judgments for E-Discovery , 2011 .

[18]  Omar Alonso,et al.  Using crowdsourcing for TREC relevance assessment , 2012, Inf. Process. Manag..

[19]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.