Advancements in Dueling Bandits

The dueling bandits problem is an online learning framework where learning happens “on-thefly” through preference feedback, ie, from comparisons between a pair of actions. Unlike conventional online learning settings that require absolute feedback for each action, the dueling bandits framework assumes only the presence of (noisy) binary feedback about the relative quality of each pair of actions. The dueling bandits problem is wellsuited for modeling settings that elicit subjective or implicit human feedback, which is typically more reliable in preference form. In this survey, we review recent results in the theories, algorithms, and applications of the dueling bandits problem. As an emerging domain, the theories and algorithms of dueling bandits have been intensively studied during the past few years. We provide an overview of recent advancements, including algorithmic advances and applications. We discuss extensions to standard problem formulation and novel application areas, highlighting key open research questions in our discussion.

[1]  Robert E. Schapire,et al.  Instance-dependent Regret Bounds for Dueling Bandits , 2016, COLT.

[2]  Hiroshi Nakagawa,et al.  Copeland Dueling Bandit Problem: Regret Lower Bound, Optimal Algorithm, and Computationally Efficient Algorithm , 2016, ICML.

[3]  Johannes Fürnkranz,et al.  A Survey of Preference-Based Reinforcement Learning Methods , 2017, J. Mach. Learn. Res..

[4]  Thorsten Joachims,et al.  Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[5]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[6]  Ingemar J. Cox,et al.  Multi-Dueling Bandits and Their Application to Online Ranker Evaluation , 2016, CIKM.

[7]  Thorsten Joachims,et al.  The K-armed Dueling Bandits Problem , 2012, COLT.

[8]  M. de Rijke,et al.  MergeRUCB: A Method for Large-Scale Online Ranker Evaluation , 2015, WSDM.

[9]  Huasen Wu,et al.  Double Thompson Sampling for Dueling Bandits , 2016, NIPS.

[10]  Arun Rajkumar,et al.  Dueling Bandits: Beyond Condorcet Winners to General Tournament Solutions , 2016, NIPS.

[11]  Robert M Thrall,et al.  Mathematics of Operations Research. , 1978 .

[12]  Neil D. Lawrence,et al.  Preferential Bayesian Optimization , 2017, ICML.

[13]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[14]  Peter Secretan Learning , 1965, Mental Health.

[15]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[16]  Fabrice Clérot,et al.  A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits , 2015, ICML.

[17]  Eyke Hüllermeier,et al.  Preference-based reinforcement learning: a formal framework and a policy iteration algorithm , 2012, Mach. Learn..

[18]  Eyke Hüllermeier,et al.  Online Rank Elicitation for Plackett-Luce: A Dueling Bandits Approach , 2015, NIPS.

[19]  Filip Radlinski,et al.  Predicting Search Satisfaction Metrics with Interleaved Comparisons , 2015, SIGIR.

[20]  M. de Rijke,et al.  Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem , 2013, ICML.

[21]  Stefan Riezler,et al.  Bandit structured prediction for learning from partial feedback in statistical machine translation , 2016, MTSUMMIT.

[22]  Joel W. Burdick,et al.  Multi-dueling Bandits with Dependent Arms , 2017, UAI.

[23]  Filip Radlinski,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[24]  Thorsten Joachims,et al.  Beat the Mean Bandit , 2011, ICML.

[25]  M. de Rijke,et al.  Click-based Hot Fixes for Underperforming Torso Queries , 2016, SIGIR.

[26]  Csaba Szepesvári,et al.  Online Learning to Rank in Stochastic Click Models , 2017, ICML.

[27]  M. de Rijke,et al.  Relative confidence sampling for efficient on-line ranker evaluation , 2014, WSDM.

[28]  Thorsten Joachims,et al.  Reducing Dueling Bandits to Cardinal Bandits , 2014, ICML.

[29]  M. de Rijke,et al.  Copeland Dueling Bandits , 2015, NIPS.

[30]  Robert D. Nowak,et al.  Sparse Dueling Bandits , 2015, AISTATS.

[31]  Nicolò Cesa-Bianchi,et al.  Regret Minimization Under Partial Monitoring , 2006, 2006 IEEE Information Theory Workshop - ITW '06 Punta del Este.

[32]  Hiroshi Nakagawa,et al.  Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem , 2015, COLT.

[33]  Raphaël Féraud,et al.  Generic Exploration and K-armed Voting Bandits , 2013, ICML.

[34]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[35]  Wataru Kumagai Regret Analysis for Continuous Dueling Bandit , 2017, NIPS.

[36]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[37]  Joel W. Burdick,et al.  Stagewise Safe Bayesian Optimization with Gaussian Processes , 2018, ICML.

[38]  Katja Hofmann,et al.  Contextual Dueling Bandits , 2015, COLT.

[39]  Joel W. Burdick,et al.  Correlational Dueling Bandits with Application to Clinical Treatment in Large Decision Spaces , 2017, IJCAI.

[40]  Bangrui Chen,et al.  Dueling Bandits with Weak Regret , 2017, ICML.