Assessors Agreement: A Case Study Across Assessor Type, Payment Levels, Query Variations and Relevance Dimensions

Relevance assessments are the cornerstone of Information Retrieval evaluation. Yet, there is only limited understanding of how assessment disagreement influences the reliability of the evaluation in terms of systems rankings. In this paper we examine the role of assessor type (expert vs. layperson), payment levels (paid vs. unpaid), query variations and relevance dimensions (topicality and understandability) and their influence on system evaluation in the presence of disagreements across assessments obtained in the different settings. The analysis is carried out in the context of the CLEF 2015 eHealth Task 2 collection and shows that disagreements between assessors belonging to the same group have little impact on evaluation. It also shows, however, that assessment disagreement found across settings has major impact on evaluation when topical relevance is considered, while it has no impact when understandability assessments are considered.

[1]  Gareth J. F. Jones,et al.  CLEF eHealth Evaluation Lab 2015, Task 2: Retrieving Information About Medical Symptoms , 2015, CLEF.

[2]  Isabelle Stanton,et al.  Circumlocution in diagnostic medical queries , 2014, SIGIR.

[3]  Ben Carterette,et al.  The effect of assessor error on IR system evaluation , 2010, SIGIR.

[4]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[5]  Leif Azzopardi Query side evaluation: an empirical analysis of effectiveness and effort , 2009, SIGIR.

[6]  Guido Zuccon,et al.  Why Assessing Relevance in Medical IR is Demanding , 2014, MedIR@SIGIR.

[7]  Peter Bailey,et al.  User Variability and IR System Evaluation , 2015, SIGIR.

[8]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[9]  Guido Zuccon,et al.  Relevation!: An open source system for information retrieval relevance assessment , 2014, SIGIR.

[10]  Guido Zuccon,et al.  Diagnose This If You Can - On the Effectiveness of Search Engines in Finding Medical Self-diagnosis Information , 2015, ECIR.

[11]  Guido Zuccon,et al.  Integrating Understandability in the Evaluation of Consumer Health Search Engines , 2014, MedIR@SIGIR.

[12]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[13]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[14]  Michael E. Lesk,et al.  Relevance assessments and retrieval system evaluation , 1968, Inf. Storage Retr..

[15]  Guido Zuccon,et al.  Understandability Biased Evaluation for Information Retrieval , 2016, ECIR.