Exploiting user disagreement for web search evaluation: an experimental approach

To express a more nuanced notion of relevance as compared to binary judgments, graded relevance levels can be used for the evaluation of search results. Especially in Web search, users strongly prefer top results over less relevant results, and yet they often disagree on which are the top results for a given information need. Whereas previous works have generally considered disagreement as a negative effect, this paper proposes a method to exploit this user disagreement by integrating it into the evaluation procedure. First, we present experiments that investigate the user disagreement. We argue that, with a high disagreement, lower relevance levels might need to be promoted more than in the case where there is global consensus on the top results. This is formalized by introducing the User Disagreement Model, resulting in a weighting of the relevance levels with a probabilistic interpretation. A validity analysis is given, and we explain how to integrate the model with well-established evaluation metrics. Finally, we discuss a specific application of the model, in the estimation of suitable weights for the combined relevance of Web search snippets and pages.

[1]  Charles L. A. Clarke,et al.  Modeling user variance in time-biased gain , 2012, HCIR '12.

[2]  Djoerd Hiemstra,et al.  Federated search in the wild: the combined power of over a hundred search engines , 2012, CIKM '12.

[3]  Dong Nguyen,et al.  Overview of the TREC 2013 Federated Web Search Track (draft) , 2013 .

[4]  Djoerd Hiemstra,et al.  Overview of the TREC 2014 Federated Web Search Track , 2013, TREC.

[5]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[6]  Charles L. A. Clarke,et al.  Overview of the TREC 2010 Web Track , 2010, TREC.

[7]  Milad Shokouhi,et al.  Expected browsing utility for web search evaluation , 2010, CIKM.

[8]  C. Buckley,et al.  Overview of the TREC 2010 Relevance Feedback Track ( Notebook ) , 2010 .

[9]  Jaana Kekäläinen,et al.  Using graded relevance assessments in IR evaluation , 2002, J. Assoc. Inf. Sci. Technol..

[10]  Ben Carterette,et al.  The effect of assessor error on IR system evaluation , 2010, SIGIR.

[11]  Efthimis N. Efthimiadis,et al.  Analyzing and evaluating query reformulation strategies in web search logs , 2009, CIKM.

[12]  Evangelos Kanoulas,et al.  Empirical justification of the gain and discount function for nDCG , 2009, CIKM.

[13]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[14]  Ben Carterette,et al.  Million Query Track 2007 Overview , 2008, TREC.

[15]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[16]  Charles L. A. Clarke,et al.  An Effectiveness Measure for Ambiguous and Underspecified Queries , 2009, ICTIR.

[17]  Robert Burgin Variations in Relevance Judgments and the Evaluation of Retrieval Performance , 1992, Inf. Process. Manag..

[18]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[19]  J. Shane Culpepper,et al.  Including summaries in system evaluation , 2009, SIGIR.

[20]  Stephen E. Robertson,et al.  A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[21]  Stephen E. Robertson,et al.  Extending average precision to graded relevance judgments , 2010, SIGIR.

[22]  Mark Sanderson,et al.  Quantifying test collection quality based on the consistency of relevance judgements , 2011, SIGIR.

[23]  Ingemar J. Cox,et al.  On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents , 2012, ECIR.

[24]  Thomas Demeester,et al.  What Snippets Say about Pages in Federated Web Search , 2012, AIRS.

[25]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[26]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.