Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation

Evaluation of search engines relies on assessments of search results for selected test queries, from which we would ideally like to draw conclusions in terms of relevance of the results for general (e.g., future, unknown) users. In practice however, most evaluation scenarios only allow us to conclusively determine the relevance towards the particular assessor that provided the judgments. A factor that cannot be ignored when extending conclusions made from assessors towards users, is the possible disagreement on relevance, assuming that a single gold truth label does not exist. This paper presents and analyzes the predicted relevance model (PRM), which allows predicting a particular result’s relevance for a random user, based on an observed assessment and knowledge on the average disagreement between assessors. With the PRM, existing evaluation metrics designed to measure binary assessor relevance, can be transformed into more robust and effectively graded measures that evaluate relevance towards a random user. It also leads to a principled way of quantifying multiple graded or categorical relevance levels for use as gains in established graded relevance measures, such as normalized discounted cumulative gain, which nowadays often use heuristic and data-independent gain values. Given a set of test topics with graded relevance judgments, the PRM allows evaluating systems on different scenarios, such as their capability of retrieving top results, or how well they are able to filter out non-relevant ones. Its use in actual evaluation scenarios is illustrated on several information retrieval test collections.

[1]  David Maxwell Chickering,et al.  Here or there: preference judgments for relevance , 2008 .

[2]  John D. Lafferty,et al.  Beyond independent relevance: methods and evaluation metrics for subtopic retrieval , 2003, SIGIR.

[3]  Eero Sormunen,et al.  Liberal relevance criteria of TREC -: counting on negligible documents? , 2002, SIGIR '02.

[4]  Djoerd Hiemstra,et al.  Federated Search in the Wild , 2012, CIKM 2012.

[5]  Tetsuya Sakai,et al.  On the reliability of information retrieval metrics based on graded relevance , 2007, Inf. Process. Manag..

[6]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[7]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[8]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[9]  Ben Carterette,et al.  Alternative assessor disagreement and retrieval depth , 2012, CIKM '12.

[10]  Gabriella Kazai,et al.  User intent and assessor disagreement in web search evaluation , 2013, CIKM.

[11]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[12]  Djoerd Hiemstra,et al.  Overview of the TREC 2014 Federated Web Search Track , 2013, TREC.

[13]  Ben Carterette,et al.  Incorporating variability in user behavior into systems based evaluation , 2012, CIKM.

[14]  Ingemar J. Cox,et al.  On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents , 2012, ECIR.

[15]  Stephen P. Harter,et al.  Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness , 1996, J. Am. Soc. Inf. Sci..

[16]  Pertti Vakkari,et al.  The influence of relevance levels on the effectiveness of interactive information retrieval , 2004, J. Assoc. Inf. Sci. Technol..

[17]  Stephen E. Robertson,et al.  Extending average precision to graded relevance judgments , 2010, SIGIR.

[18]  Yong Yu,et al.  Learning the Gain Values and Discount Factors of DCG , 2012, ArXiv.

[19]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[20]  Mark D. Smucker,et al.  A qualitative exploration of secondary assessor relevance judging behavior , 2014, IIiX.

[21]  Hongyuan Zha,et al.  Learning the Gain Values and Discount Factors of Discounted Cumulative Gains , 2014, IEEE Transactions on Knowledge and Data Engineering.

[22]  Ben Carterette,et al.  The effect of assessor error on IR system evaluation , 2010, SIGIR.

[23]  J. Shane Culpepper,et al.  Including summaries in system evaluation , 2009, SIGIR.

[24]  Tetsuya Sakai,et al.  Overview of NTCIR-10 , 2013, NTCIR.

[25]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[26]  Evangelos Kanoulas,et al.  Empirical justification of the gain and discount function for nDCG , 2009, CIKM.

[27]  Milad Shokouhi,et al.  Expected browsing utility for web search evaluation , 2010, CIKM.

[28]  Djoerd Hiemstra,et al.  FedWeb Greatest Hits: Presenting the New Test Collection for Federated Web Search , 2015, WWW.

[29]  Djoerd Hiemstra,et al.  Exploiting user disagreement for web search evaluation: an experimental approach , 2014, WSDM.

[30]  Charles L. A. Clarke,et al.  Modeling user variance in time-biased gain , 2012, HCIR '12.

[31]  Djoerd Hiemstra,et al.  Federated search in the wild: the combined power of over a hundred search engines , 2012, CIKM '12.

[32]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[33]  Yiqun Liu,et al.  Overview of the NTCIR-9 INTENT Task , 2011, NTCIR.

[34]  Jaana Kekäläinen,et al.  Binary and graded relevance in IR evaluations--Comparison of the effects on ranking of IR systems , 2005, Inf. Process. Manag..

[35]  Yiqun Liu,et al.  Overview of the NTCIR-10 INTENT-2 Task , 2013, NTCIR.