Multileaved Comparisons for Fast Online Evaluation

Evaluation methods for information retrieval systems come in three types: offline evaluation, using static data sets annotated for relevance by human judges; user studies, usually conducted in a lab-based setting; and online evaluation, using implicit signals such as clicks from actual users. For the latter, preferences between rankers are typically inferred from implicit signals via interleaved comparison methods, which combine a pair of rankings and display the result to the user. We propose a new approach to online evaluation called multileaved comparisons that is useful in the prevalent case where designers are interested in the relative performance of more than two rankers. Rather than combining only a pair of rankings, multileaved comparisons combine an arbitrary number of rankings. The resulting user clicks then give feedback about how all these rankings compare to each other. We propose two specific multileaved comparison methods. The first, called team draft multileave, is an extension of team draft interleave. The second, called optimized multileave, is an extension of optimized interleave and is designed to handle cases where a large number of rankers must be multileaved. We present experimental results that demonstrate that both team draft multileave and optimized multileave can accurately determine all pairwise preferences among a set of rankers using far less data than the interleaving methods that they extend.

[1]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[2]  Cyril W. Cleverdon,et al.  Aslib Cranfield research project - Factors determining the performance of indexing systems; Volume 1, Design; Part 2, Appendices , 1966 .

[3]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[4]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[5]  Thorsten Joachims,et al.  Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.

[6]  Mark Sanderson,et al.  Forming test collections with no system pooling , 2004, SIGIR '04.

[7]  James Allan,et al.  Incremental test collections , 2005, CIKM '05.

[8]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[9]  Ben Carterette,et al.  Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks , 2007, NIPS.

[10]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[11]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[12]  M. de Rijke,et al.  Building simulated queries for known-item topics: an analysis using six european languages , 2007, SIGIR.

[13]  Tao Qin,et al.  LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval , 2007 .

[14]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[15]  Ron Kohavi,et al.  Controlled experiments on the web: survey and practical guide , 2009, Data Mining and Knowledge Discovery.

[16]  Thorsten Joachims,et al.  The K-armed Dueling Bandits Problem , 2012, COLT.

[17]  Diane Kelly,et al.  Methods for Evaluating Interactive Information Retrieval Systems with Users , 2009, Found. Trends Inf. Retr..

[18]  Chao Liu,et al.  Efficient multiple-click models in web search , 2009, WSDM '09.

[19]  ChengXiang Zhai,et al.  Evaluation of methods for relative comparison of retrieval systems based on clickthroughs , 2009, CIKM.

[20]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[21]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[22]  Filip Radlinski,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[23]  Katja Hofmann,et al.  Fast and reliable online learning to rank for information retrieval , 2013, SIGIR Forum.

[24]  Katja Hofmann,et al.  Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods , 2013, TOIS.

[25]  Katja Hofmann,et al.  Lerot: an online learning to rank framework , 2013, LivingLab '13.

[26]  Filip Radlinski,et al.  Optimized interleaving for online retrieval evaluation , 2013, WSDM.

[27]  Katja Hofmann,et al.  Reusing historical interaction data for faster online learning to rank for IR , 2013, DIR.

[28]  M. de Rijke,et al.  Pseudo test collections for training and tuning microblog rankers , 2013, SIGIR.

[29]  Katja Hofmann,et al.  Evaluating aggregated search using interleaving , 2013, CIKM.

[30]  M. de Rijke,et al.  Relative confidence sampling for efficient on-line ranker evaluation , 2014, WSDM.

[31]  Shimon Whiteson Using Confidence Bounds for Efficient On−Line Ranker Evaluation , 2014 .

[32]  M. de Rijke,et al.  Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem , 2013, ICML.