Copeland Dueling Bandits

A version of the dueling bandit problem is addressed in which a Condorcet winner may not exist. Two algorithms are proposed that instead seek to minimize regret with respect to the Copeland winner, which, unlike the Condorcet winner, is guaranteed to exist. The first, Copeland Confidence Bound (CCB), is designed for small numbers of arms, while the second, Scalable Copeland Bandits (SCB), works better for large-scale problems. We provide theoretical results bounding the regret accumulated by CCB and SCB, both substantially improving existing results. Such existing results either offer bounds of the form O(K log T) but require restrictive assumptions, or offer bounds of the form O(K2 log T) without requiring such assumptions. Our results offer the best of both worlds: O (K log T) bounds without restrictive assumptions.

[1]  M. de Rijke,et al.  MergeRUCB: A Method for Large-Scale Online Ranker Evaluation , 2015, WSDM.

[2]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[3]  R. Munos,et al.  Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[4]  Ron Kohavi,et al.  Online controlled experiments at large scale , 2013, KDD.

[5]  Katja Hofmann,et al.  Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods , 2013, TOIS.

[6]  Moshe Tennenholtz,et al.  On the Axiomatic Foundations of Ranking Systems , 2005, IJCAI.

[7]  Raphaël Féraud,et al.  Generic Exploration and K-armed Voting Bandits , 2013, ICML.

[8]  Markus Schulze,et al.  A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single-winner election method , 2011, Soc. Choice Welf..

[9]  Lihong Li,et al.  Toward Predicting the Outcome of an A/B Experiment for Search Relevance , 2015, WSDM.

[10]  Thorsten Joachims,et al.  The K-armed Dueling Bandits Problem , 2012, COLT.

[11]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[12]  Thorsten Joachims,et al.  Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[13]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[14]  Alexander J. Smola,et al.  Exponential Regret Bounds for Gaussian Process Bandits with Deterministic Observations , 2012, ICML.

[15]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[16]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[17]  M. de Rijke,et al.  Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem , 2013, ICML.

[18]  M. de Rijke,et al.  Multileaved Comparisons for Fast Online Evaluation , 2014, CIKM.

[19]  Eyke Hüllermeier,et al.  Preference Learning , 2005, Künstliche Intell..

[20]  Christian Schindelhauer,et al.  Discrete Prediction Games with Arbitrary Feedback and Loss , 2001, COLT/EuroCOLT.

[21]  M. de Rijke,et al.  Relative confidence sampling for efficient on-line ranker evaluation , 2014, WSDM.

[22]  Csaba Szepesvári,et al.  –armed Bandits , 2022 .

[23]  Rémi Munos,et al.  Stochastic Simultaneous Optimistic Optimization , 2013, ICML.

[24]  Csaba Szepesvári,et al.  An adaptive algorithm for finite stochastic partial monitoring , 2012, ICML.

[25]  Thorsten Joachims,et al.  Reducing Dueling Bandits to Cardinal Bandits , 2014, ICML.

[26]  Eyke Hüllermeier,et al.  A Survey of Preference-Based Online Learning with Bandit Algorithms , 2014, ALT.

[27]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[28]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[29]  Eyke Hüllermeier,et al.  Top-k Selection based on Adaptive Sampling of Noisy Preferences , 2013, ICML.

[30]  Chao Liu,et al.  Efficient multiple-click models in web search , 2009, WSDM '09.

[31]  Katja Hofmann,et al.  Contextual Dueling Bandits , 2015, COLT.

[32]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[33]  R. Rivest,et al.  An Optimal Single-Winner Preferential Voting System Based on Game Theory , 2010 .

[34]  Eyke Hüllermeier,et al.  PAC Rank Elicitation through Adaptive Sampling of Stochastic Pairwise Preferences , 2014, AAAI.

[35]  Thorsten Joachims,et al.  Beat the Mean Bandit , 2011, ICML.

[36]  Devavrat Shah,et al.  Iterative ranking from pair-wise comparisons , 2012, NIPS.

[37]  Rémi Munos,et al.  Optimistic Optimization of Deterministic Functions , 2011, NIPS 2011.

[38]  Adam D. Bull,et al.  Convergence Rates of Efficient Global Optimization Algorithms , 2011, J. Mach. Learn. Res..

[39]  Katja Hofmann,et al.  Information Retrieval manuscript No. (will be inserted by the editor) Balancing Exploration and Exploitation in Listwise and Pairwise Online Learning to Rank for Information Retrieval , 2022 .

[40]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.