Evaluating Relevance Judgments with Pairwise Discriminative Power

Relevance judgments play an essential role in the evaluation of information retrieval systems. As many different relevance judgment settings have been proposed in recent years, an evaluation metric to compare relevance judgments in different annotation settings has become a necessity. Traditional metrics, such as ĸ, Krippendorff's α and Φ have mainly focused on the inter-assessor consistency to evaluate the quality of relevance judgments. They encounter "reliable but useless" problem when employed to compare different annotation settings (e.g. binary judgment v.s. 4-grade judgment). Meanwhile, other existing popular metrics such as discriminative power (DP) are not designed to compare relevance judgments across different annotation settings, they therefore suffer from limitations, such as the requirement of result ranking lists from different systems. Therefore, how to design an evaluation metric to compare relevance judgments under different grade settings needs further investigation. In this work, we propose a novel metric named pairwise discriminative power (PDP) to evaluate the quality of relevance judgment collections. By leveraging a small amount of document-level preference tests, PDP estimates the discriminative ability of relevance judgments on separating ranking lists with various qualities. With comprehensive experiments on both synthetic and real-world datasets, we show that PDP maintains a high degree of consistency with annotation quality in various grade settings. Compared with existing metrics (e.g., Krippendorff's α, Φ, DP, etc), it provides reliable evaluation results with affordable additional annotation efforts.

[1]  M. D. Rijke,et al.  Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF , 2019, Information Retrieval Evaluation in a Changing World.

[2]  Tetsuya Sakai,et al.  Good Evaluation Measures based on Document Preferences , 2020, SIGIR.

[3]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[4]  Mark D. Smucker,et al.  Offline Evaluation by Maximum Similarity to an Ideal Ranking , 2020, CIKM.

[5]  Ben Carterette,et al.  Using preference judgments for novel document retrieval , 2012, SIGIR '12.

[6]  Stefano Mizzaro,et al.  A Formal Account of Effectiveness Evaluation and Ranking Fusion , 2018, ICTIR.

[7]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[8]  D. Harman,et al.  TREC: Experiment and Evaluation in Information Retrieval , 2006 .

[9]  Cheng Luo,et al.  Overview of the NTCIR-13 We Want Web Task , 2017, NTCIR.

[10]  Ben Carterette,et al.  An Analysis of Assessor Behavior in Crowdsourced Preference Judgments , 2010 .

[11]  M. Rosenblatt A CENTRAL LIMIT THEOREM AND A STRONG MIXING CONDITION. , 1956, Proceedings of the National Academy of Sciences of the United States of America.

[12]  David Maxwell Chickering,et al.  Here or There , 2008, ECIR.

[13]  Haldun Akoglu,et al.  User's guide to correlation coefficients , 2018, Turkish journal of emergency medicine.

[14]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[15]  Tetsuya Sakai,et al.  Evaluating Information Retrieval and Access Tasks: NTCIR's Legacy of Research Impact , 2021 .

[16]  B. Everitt,et al.  Large sample standard errors of kappa and weighted kappa. , 1969 .

[17]  Eddy Maddalena,et al.  On Transforming Relevance Scales , 2019, CIKM.

[18]  Ellen M. Voorhees,et al.  TREC 2014 Web Track Overview , 2015, TREC.

[19]  Cyril W. Cleverdon,et al.  Aslib Cranfield research project - Factors determining the performance of indexing systems; Volume 1, Design; Part 2, Appendices , 1966 .

[20]  Eddy Maddalena,et al.  Let's Agree to Disagree: Fixing Agreement Measures for Crowdsourcing , 2017, HCOMP.

[21]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[22]  Falk Scholer,et al.  The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation , 2015, SIGIR.

[23]  Eero Sormunen,et al.  Liberal relevance criteria of TREC -: counting on negligible documents? , 2002, SIGIR '02.

[24]  Alistair Moffat,et al.  Pairwise Crowd Judgments: Preference, Absolute, and Ratio , 2018, ADCS.

[25]  Rong Tang,et al.  Towards the Identification of the Optimal Number of Relevance Categories , 1999, J. Am. Soc. Inf. Sci..

[26]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[27]  Reuven Y. Rubinstein,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[28]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[29]  Eddy Maddalena,et al.  On Fine-Grained Relevance Scales , 2018, SIGIR.

[30]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[31]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[32]  John Guiver,et al.  Bayesian inference for Plackett-Luce ranking models , 2009, ICML '09.

[33]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[34]  Klaus Krippendorff,et al.  Computing Krippendorff's Alpha-Reliability , 2011 .

[35]  Zhicheng Dou,et al.  Overview of the NTCIR-15 We Want Web with CENTRE (WWW-3) Task , 2020 .

[36]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[37]  Ben Carterette,et al.  Preference based evaluation measures for novelty and diversity , 2013, SIGIR.

[38]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[39]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[40]  Ellen M. Voorhees,et al.  Overview of TREC 2003 , 2003, TREC.

[41]  Charles L. A. Clarke,et al.  Offline Evaluation without Gain , 2020, ICTIR.

[42]  Tetsuya Sakai,et al.  Evaluating diversified search results using per-intent graded relevance , 2011, SIGIR.