Better than Their Reputation? On the Reliability of Relevance Assessments with Students

During the last three years we conducted several information retrieval evaluation series with more than 180 LIS students who made relevance assessments on the outcomes of three specific retrieval services. In this study we do not focus on the retrieval performance of our system but on the relevance assessments and the inter-assessor reliability. To quantify the agreement we apply Fleiss' Kappa and Krippendorff's Alpha. When we compare these two statistical measures on average Kappa values were 0.37 and Alpha values 0.15. We use the two agreement measures to drop too unreliable assessments from our data set. When computing the differences between the unfiltered and the filtered data set we see a root mean square error between 0.02 and 0.12. We see this as a clear indicator that disagreement affects the reliability of retrieval evaluations. We suggest not to work with unfiltered results or to clearly document the disagreement rates.

[1]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[2]  Philipp Mayr,et al.  Implications of Inter-Rater Agreement on a Student Information Retrieval Evaluation , 2010, LWA.

[3]  Klaus Krippendorff,et al.  Computing Krippendorff's Alpha-Reliability , 2011 .

[4]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[5]  Omar Alonso,et al.  Crowdsourcing Assessments for XML Ranked Retrieval , 2010, ECIR.

[6]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[7]  Fernando Diaz,et al.  A Methodology for Evaluating Aggregated Search Results , 2011, ECIR.

[8]  Andrew Trotman,et al.  Sound and complete relevance assessment for XML retrieval , 2008, TOIS.

[9]  York Sure-Vetter,et al.  Applying Science Models for Search , 2011, ISI.

[10]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[11]  Yong Yu,et al.  Select-the-Best-Ones: A new way to judge relative relevance , 2011, Inf. Process. Manag..

[12]  K. Krippendorff Reliability in Content Analysis: Some Common Misconceptions and Recommendations , 2004 .

[13]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[14]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[15]  Alan F. Smeaton,et al.  A study of inter-annotator agreement for opinion retrieval , 2009, SIGIR.

[16]  Pia Borlund,et al.  The concept of relevance in IR , 2003, J. Assoc. Inf. Sci. Technol..

[17]  York Sure-Vetter,et al.  Science models as value-added services for scholarly information systems , 2011, Scientometrics.

[18]  D. Wentura,et al.  Wissenschaftliche Beobachtung: eine Einführung , 1997 .

[19]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[20]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[21]  Ellen M. Voorhees,et al.  Topic set size redux , 2009, SIGIR.