MergeRUCB: A Method for Large-Scale Online Ranker Evaluation

A key challenge in information retrieval is that of on-line ranker evaluation: determining which one of a finite set of rankers performs the best in expectation on the basis of user clicks on presented document lists. When the presented lists are constructed using interleaved comparison methods, which interleave lists proposed by two different candidate rankers, then the problem of minimizing the total regret accumulated while evaluating the rankers can be formalized as a K-armed dueling bandit problem. In the setting of web search, the number of rankers under consideration may be large. Scaling effectively in the presence of so many rankers is a key challenge not adequately addressed by existing algorithms. We propose a new method, which we call mergeRUCB, that uses "localized" comparisons to provide the first provably scalable K-armed dueling bandit algorithm. Empirical comparisons on several large learning to rank datasets show that mergeRUCB can substantially outperform the state of the art K-armed dueling bandit algorithms when many rankers must be compared. Moreover, we provide theoretical guarantees demonstrating the soundness of our algorithm.

[1]  M. de Rijke,et al.  Relative confidence sampling for efficient on-line ranker evaluation , 2014, WSDM.

[2]  Thorsten Joachims,et al.  Reducing Dueling Bandits to Cardinal Bandits , 2014, ICML.

[3]  Chao Liu,et al.  Efficient multiple-click models in web search , 2009, WSDM '09.

[4]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[5]  Alexander J. Smola,et al.  Exponential Regret Bounds for Gaussian Process Bandits with Deterministic Observations , 2012, ICML.

[6]  Katja Hofmann,et al.  Lerot: an online learning to rank framework , 2013, LivingLab '13.

[7]  Umar Syed,et al.  Bandits, Query Learning, and the Haystack Dimension , 2011, COLT.

[8]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[9]  Andrew Trotman,et al.  Comparative analysis of clicks and judgments for IR evaluation , 2009, WSCD '09.

[10]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[11]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[12]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[13]  M. de Rijke,et al.  Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem , 2013, ICML.

[14]  Katja Hofmann,et al.  Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods , 2013, TOIS.

[15]  Raphaël Féraud,et al.  Generic Exploration and K-armed Voting Bandits , 2013, ICML.

[16]  ChengXiang Zhai,et al.  Evaluation of methods for relative comparison of retrieval systems based on clickthroughs , 2009, CIKM.

[17]  Thorsten Joachims,et al.  The K-armed Dueling Bandits Problem , 2012, COLT.

[18]  Ron Kohavi,et al.  Online controlled experiments at large scale , 2013, KDD.

[19]  Filip Radlinski,et al.  Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.

[20]  Cyril W. Cleverdon,et al.  Aslib Cranfield research project - Factors determining the performance of indexing systems; Volume 1, Design; Part 2, Appendices , 1966 .

[21]  Milad Shokouhi,et al.  Using Clicks as Implicit Judgments: Expectations Versus Observations , 2008, ECIR.

[22]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[23]  Katja Hofmann,et al.  Information Retrieval manuscript No. (will be inserted by the editor) Balancing Exploration and Exploitation in Listwise and Pairwise Online Learning to Rank for Information Retrieval , 2022 .

[24]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[25]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[26]  Filip Radlinski,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[27]  Thorsten Joachims,et al.  Beat the Mean Bandit , 2011, ICML.

[28]  Katja Hofmann,et al.  Evaluating aggregated search using interleaving , 2013, CIKM.

[29]  Jonathan L. Herlocker,et al.  Click data as implicit relevance feedback in web search , 2007, Inf. Process. Manag..

[30]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[31]  Yi Chang,et al.  Yahoo! Learning to Rank Challenge Overview , 2010, Yahoo! Learning to Rank Challenge.

[32]  Thorsten Joachims,et al.  Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.

[33]  Filip Radlinski,et al.  Optimized interleaving for online retrieval evaluation , 2013, WSDM.