Further Optimal Regret Bounds for Thompson Sampling

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have comparable or better empirical performance compared to the state of the art methods. In this paper, we provide a novel regret analysis for Thompson Sampling that proves the first near-optimal problem-independent bound of O( √ NT lnT ) on the expected regret of this algorithm. Our novel martingale-based analysis techniques are conceptually simple, and easily extend to distributions other than the Beta distribution. For the version of Thompson Sampling that uses Gaussian priors, we prove a problem-independent bound of O( √ NT lnN) on the expected regret, and demonstrate the optimality of this bound by providing a matching lower bound. This lower bound of Ω( √ NT lnN) is the first lower bound on the performance of a natural version of Thompson Sampling that is away from the general lower bound of O( √ NT ) for the multi-armed bandit problem. Our near-optimal problem-independent bounds for Thompson Sampling solve a COLT 2012 open problem of Chapelle and Li. Additionally, our techniques simultaneously provide the optimal problem-dependent bound of (1+ ǫ) ∑ i lnT d(μi,μ1) +O(Nǫ2 ) on the expected regret. The optimal problem-dependent regret bound for this problem was first proven recently by Kaufmann et al. [2012b]. Appearing in Proceedings of the 16 International Conference on Artificial Intelligence and Statistics (AISTATS) 2013, Scottsdale, AZ, USA. Volume 31 of JMLR: W&CP 31. Copyright 2013 by the authors.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[3]  D. Owen Handbook of Mathematical Functions with Formulas , 1965 .

[4]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[5]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[6]  Jeremy Wyatt,et al.  Exploration and inference in learning from reinforcement , 1998 .

[7]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[8]  Costas Courcoubetis,et al.  Pricing Communication Networks: Economics, Technology and Modelling (Wiley Interscience Series in Systems and Optimization) , 2003 .

[9]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[10]  Emil Jerábek,et al.  Dual weak pigeonhole principle, Boolean complexity, and derandomization , 2004, Annals of Pure and Applied Logic.

[11]  Jean-Yves Audibert,et al.  Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[12]  Joaquin Quiñonero Candela,et al.  Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[13]  Ole-Christoffer Granmo,et al.  Solving two-armed Bernoulli bandit problems using a Bayesian learning automaton , 2010, Int. J. Intell. Comput. Cybern..

[14]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[15]  Benedict C. May Simulation Studies in Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2011 .

[16]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[17]  Rémi Munos,et al.  A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences , 2011, COLT.

[18]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[19]  Rémi Munos,et al.  Thompson Sampling: An Optimal Finite Time Analysis , 2012, ArXiv.

[20]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[21]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[22]  David S. Leslie,et al.  Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2012, J. Mach. Learn. Res..

[23]  Lihong Li,et al.  Open Problem: Regret Bounds for Thompson Sampling , 2012, COLT.

[24]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[25]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[26]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .