On Fine-Grained Relevance Scales

In Information Retrieval evaluation, the classical approach of adopting binary relevance judgments has been replaced by multi-level relevance judgments and by gain-based metrics leveraging such multi-level judgment scales. Recent work has also proposed and evaluated unbounded relevance scales by means of Magnitude Estimation (ME) and compared them with multi-level scales. While ME brings advantages like the ability for assessors to always judge the next document as having higher or lower relevance than any of the documents they have judged so far, it also comes with some drawbacks. For example, it is not a natural approach for human assessors to judge items as they are used to do on the Web (e.g., 5-star rating). In this work, we propose and experimentally evaluate a bounded and fine-grained relevance scale having many of the advantages and dealing with some of the issues of ME. We collect relevance judgments over a 100-level relevance scale (S100) by means of a large-scale crowdsourcing experiment and compare the results with other relevance scales (binary, 4-level, and ME) showing the benefit of fine-grained scales over both coarse-grained and unbounded scales as well as highlighting some new results on ME. Our results show that S100 maintains the flexibility of unbounded scales like ME in providing assessors with ample choice when judging document relevance (i.e., assessors can fit relevance judgments in between of previously given judgments). It also allows assessors to judge on a more familiar scale (e.g., on 10 levels) and to perform efficiently since the very first judging task.

[1]  Ellen M. Voorhees,et al.  TREC 2014 Web Track Overview , 2015, TREC.

[2]  Carsten Eickhoff,et al.  Cognitive Biases in Crowdsourcing , 2018, WSDM.

[3]  Falk Scholer,et al.  The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation , 2015, SIGIR.

[4]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[5]  Klaus Krippendorff,et al.  Computing Krippendorff's Alpha-Reliability , 2011 .

[6]  Ellen M. Voorhees,et al.  Overview of the TREC 2014 Web Track , 2017 .

[7]  Tetsuya Sakai,et al.  On the reliability of information retrieval metrics based on graded relevance , 2007, Inf. Process. Manag..

[8]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[9]  Falk Scholer,et al.  On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation , 2017, ACM Trans. Inf. Syst..

[10]  Ingemar J. Cox,et al.  On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents , 2012, ECIR.

[11]  Matthew Lease,et al.  Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments , 2016, HCOMP.

[12]  Eero Sormunen,et al.  Liberal relevance criteria of TREC -: counting on negligible documents? , 2002, SIGIR '02.

[13]  Omar Alonso,et al.  Using crowdsourcing for TREC relevance assessment , 2012, Inf. Process. Manag..

[14]  Rong Tang,et al.  Towards the Identification of the Optimal Number of Relevance Categories , 1999, J. Am. Soc. Inf. Sci..

[15]  Philip J. Corriveau,et al.  Study of Rating Scales for Subjective Quality Assessment of High-Definition Video , 2011, IEEE Transactions on Broadcasting.

[16]  Marco Basaldella,et al.  Crowdsourcing Relevance Assessments: The Unexpected Benefits of Limiting the Time to Judge , 2016, HCOMP.

[17]  T. Saracevic,et al.  Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance , 2007, J. Assoc. Inf. Sci. Technol..

[18]  G. Gescheider Psychophysics: The Fundamentals , 1997 .

[19]  Neha Gupta,et al.  Modus Operandi of Crowd Workers , 2017, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[20]  James Allan,et al.  Comparing In Situ and Multidimensional Relevance Judgments , 2017, SIGIR.

[21]  Milad Shokouhi,et al.  Community-based bayesian aggregation models for crowdsourcing , 2014, WWW.

[22]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.