Good Evaluation Measures based on Document Preferences

For offline evaluation of IR systems, some researchers have proposed to utilise pairwise document preference assessments instead of relevance assessments of individual documents, as it may be easier for assessors to make relative decisions rather than absolute ones. Simple preference-based evaluation measures such as ppref and wpref have been proposed, but the past decade did not see any wide use of such measures. One reason for this may be that, while these new measures have been reported to behave more or less similarly to traditional measures based on absolute assessments, whether they actually align with the users' perception of search engine result pages (SERPs) has been unknown. The present study addresses exactly this question, after formally defining two classes of preference-based measures called Pref measures and Δ-measures. We show that the best of these measures perform at least as well as an average assessor in terms of agreement with users' SERP preferences, and that implicit document preferences (i.e., those suggested by a SERP that retrieves one document but not the other) play a much more important role than explicit preferences (i.e., those suggested by a SERP that retrieves one document above the other). We have released our data set containing 119,646 document preferences, so that the feasibility of document preferenced-based evaluation can be further pursued by the IR community.

[1]  Peter Bailey,et al.  Incorporating User Expectations and Behavior into the Measurement of Search Effectiveness , 2017, ACM Trans. Inf. Syst..

[2]  Tetsuya Sakai,et al.  Randomised vs. Prioritised Pools for Relevance Assessments: Sample Size Considerations , 2019, AIRS.

[3]  Andrew Turpin,et al.  Do batch and user evaluations give the same results? , 2000, SIGIR '00.

[4]  Tetsuya Sakai,et al.  Alternatives to Bpref , 2007, SIGIR.

[5]  Fernando Diaz,et al.  Contextual and dimensional relevance judgments for reusable SERP-level evaluation , 2014, WWW.

[6]  Peter Schäuble,et al.  Determining the effectiveness of retrieval algorithms , 1991, Inf. Process. Manag..

[7]  Tetsuya Sakai,et al.  Evaluating diversified search results using per-intent graded relevance , 2011, SIGIR.

[8]  Alan Halverson,et al.  Generating labels from clicks , 2009, WSDM '09.

[9]  Gabriella Kazai,et al.  User intent and assessor disagreement in web search evaluation , 2013, CIKM.

[10]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[11]  Tetsuya Sakai Evaluation with informational and navigational intents , 2012, WWW.

[12]  Tetsuya Sakai,et al.  Diversified search evaluation: lessons from the NTCIR-9 INTENT task , 2012, Information Retrieval.

[13]  Tetsuya Sakai,et al.  Ranking Rich Mobile Verticals based on Clicks and Abandonment , 2017, CIKM.

[14]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[15]  Falk Scholer,et al.  User performance versus precision measures for simple search tasks , 2006, SIGIR.

[16]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[17]  Dirk Lewandowski,et al.  What Users See - Structures in Search Engine Results Pages , 2009, Inf. Sci..

[18]  Tetsuya Sakai,et al.  Metrics, Statistics, Tests , 2013, PROMISE Winter School.

[19]  Falk Scholer,et al.  Metric and Relevance Mismatch in Retrieval Evaluation , 2009, AIRS.

[20]  Norman Cliff,et al.  Confidence intervals for Kendall's tau , 1997 .

[21]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[22]  Xiaojie Yuan,et al.  Are click-through data adequate for learning web search rankings? , 2008, CIKM '08.

[23]  Ben Carterette,et al.  Using preference judgments for novel document retrieval , 2012, SIGIR '12.

[24]  Mark Sanderson,et al.  Do user preferences and evaluation measures line up? , 2010, SIGIR.

[25]  Yong Yu,et al.  Select-the-Best-Ones: A new way to judge relative relevance , 2011, Inf. Process. Manag..

[26]  Yiqun Liu,et al.  When does Relevance Mean Usefulness and User Satisfaction in Web Search? , 2016, SIGIR.

[27]  Ben Carterette,et al.  Preference based evaluation measures for novelty and diversity , 2013, SIGIR.