Offline Evaluation without Gain

We propose a simple and flexible framework for offline evaluation based on a weak ordering of results (which we call "partial preferences") that define a set of ideal rankings for a query. These partial preferences can be derived from from side-by-side preference judgments, from graded judgments, from a combination of the two, or through other methods. We then measure the performance of a ranker by computing the maximum similarity between the actual ranking it generates for the query and elements of this ideal result set. We call this measure the "compatibility" of the actual ranking with the ideal result set. We demonstrate that compatibility can replace and extend current offline evaluation measures that depend on fixed relevance grades that must be mapped to gain values, such as NDCG. We examine a specific instance of compatibility based on rank biased overlap (RBO). We experimentally validate compatibility over multiple collections with different types of partial preferences, including very fine-grained preferences and partial preferences focused on the top ranks. As well as providing additional insights and flexibility, compatibility avoids shortcomings of both full preference judgments and traditional graded judgments.

[1]  Xueqi Cheng,et al.  Top-k learning to rank: labeling, ranking and evaluation , 2012, SIGIR '12.

[2]  Ben Carterette,et al.  System effectiveness, user models, and user utility: a conceptual framework for investigation , 2011, SIGIR.

[3]  Alistair Moffat,et al.  Pairwise Crowd Judgments: Preference, Absolute, and Ratio , 2018, ADCS.

[4]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[5]  Sergei Vassilvitskii,et al.  Generalized distances between rankings , 2010, WWW '10.

[6]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[7]  Mark Sanderson,et al.  Do user preferences and evaluation measures line up? , 2010, SIGIR.

[8]  Dietmar Jannach,et al.  Are we really making much progress? A worrying analysis of recent neural recommendation approaches , 2019, RecSys.

[9]  Jimmy J. Lin,et al.  The Neural Hype and Comparisons Against Weak Baselines , 2019, SIGIR Forum.

[10]  Nir Ailon,et al.  Ranking from pairs and triplets: information quality, evaluation methods and query complexity , 2011, WSDM '11.

[11]  Charles L. A. Clarke,et al.  A Family of Rank Similarity Measures Based on Maximized Effectiveness Difference , 2015, IEEE Transactions on Knowledge and Data Engineering.

[12]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[13]  Charles L. A. Clarke,et al.  Offline Evaluation by Maximum Similarity to an Ideal Ranking , 2020, CIKM.

[14]  Massimo Melucci,et al.  Weighted Rank Correlation in Information Retrieval Evaluation , 2009, AIRS.

[15]  Falk Scholer,et al.  On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation , 2017, ACM Trans. Inf. Syst..

[16]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[17]  Chris Buckley,et al.  Topic prediction based on comparative retrieval rankings , 2004, SIGIR '04.

[18]  Tetsuya Sakai,et al.  Alternatives to Bpref , 2007, SIGIR.

[19]  J. Shane Culpepper,et al.  Query Driven Algorithm Selection in Early Stage Retrieval , 2018, WSDM.

[20]  Mingxuan Sun,et al.  Visualizing differences in web search algorithms using the expected weighted hoeffding distance , 2010, WWW '10.

[21]  Peter Schäuble,et al.  Determining the effectiveness of retrieval algorithms , 1991, Inf. Process. Manag..

[22]  Milad Shokouhi,et al.  Expected browsing utility for web search evaluation , 2010, CIKM.

[23]  Mark E. Rorvig,et al.  The Simple Scalability of Documents. , 1990 .

[24]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[25]  Paul N. Bennett,et al.  Pairwise ranking aggregation in a crowdsourced setting , 2013, WSDM.

[26]  Charles L. A. Clarke,et al.  Assessing Top- Preferences , 2020, ACM Trans. Inf. Syst..