论文信息 - Near-Optimal Regret Bounds for Thompson Sampling - 字舞流文

Near-Optimal Regret Bounds for Thompson Sampling

Thompson Sampling (TS) is one of the oldest heuristics for multiarmed bandit problems. It is a randomized algorithm based on Bayesian ideas and has recently generated significant interest after several studies demonstrated that it has favorable empirical performance compared to the state-of-the-art methods. In this article, a novel and almost tight martingale-based regret analysis for Thompson Sampling is presented. Our technique simultaneously yields both problem-dependent and problem-independent bounds: (1) the first near-optimal problem-independent bound of O(√ NT ln T) on the expected regret and (2) the optimal problem-dependent bound of (1 + ϵ)Σi ln T / d(μi,μ1) + O(N/ϵ2) on the expected regret (this bound was first proven by Kaufmann et al. (2012b)). Our technique is conceptually simple and easily extends to distributions other than the Beta distribution used in the original TS algorithm. For the version of TS that uses Gaussian priors, we prove a problem-independent bound of O(√ NT ln N) on the expected regret and show the optimality of this bound by providing a matching lower bound. This is the first lower bound on the performance of a natural version of Thompson Sampling that is away from the general lower bound of Ω (√ NT) for the multiarmed bandit problem.

Shipra Agrawal | Navin Goyal | Shipra Agrawal | Navin Goyal

[1] AgrawalShipra,et al. Near-Optimal Regret Bounds for Thompson Sampling , 2017 .

[2] Sébastien Bubeck,et al. Prior-free and prior-dependent regret bounds for Thompson Sampling , 2013, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[3] Lihong Li,et al. Open Problem: Regret Bounds for Thompson Sampling , 2012, COLT.

[4] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[5] Shipra Agrawal,et al. Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[6] Benjamin Van Roy,et al. (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[7] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[8] Shipra Agrawal,et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[9] Lihong Li,et al. Generalized Thompson Sampling for Contextual Bandits , 2013, ArXiv.

[10] Rémi Munos,et al. A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences , 2011, COLT.

[11] Jean-Yves Audibert,et al. Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[12] Rémi Munos,et al. Spectral Thompson Sampling , 2014, AAAI.

[13] Benjamin Van Roy,et al. An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[14] François Laviolette,et al. Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[15] Aurélien Garivier,et al. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[16] Robin S. McDowell,et al. Erratum: Handbook of mathematical functions with formulas, graphs, and mathematical tables (Nat. Bur. Standards, Washington, D.C., 1964) edited by Milton Abramowitz and Irene A. Stegun , 1973 .

[17] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[18] Milton Abramowitz,et al. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[19] Rémi Munos,et al. Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.

[20] Ole-Christoffer Granmo,et al. Solving two-armed Bernoulli bandit problems using a Bayesian learning automaton , 2010, Int. J. Intell. Comput. Cybern..

[21] Rémi Munos,et al. Thompson Sampling: An Optimal Finite Time Analysis , 2012, ArXiv.

[22] Steven L. Scott,et al. A modern Bayesian look at the multi-armed bandit , 2010 .

[23] Emil Jerábek,et al. Dual weak pigeonhole principle, Boolean complexity, and derandomization , 2004, Annals of Pure and Applied Logic.

[24] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[25] Jeremy Wyatt,et al. Exploration and inference in learning from reinforcement , 1998 .

[26] Rémi Munos,et al. Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[27] M. Abramowitz,et al. Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[28] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[29] Benedict C. May. Simulation Studies in Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2011 .

[30] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[31] Christian M. Ernst,et al. Multi-armed Bandit Allocation Indices , 1989 .

[32] David S. Leslie,et al. Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2012, J. Mach. Learn. Res..

[33] Shipra Agrawal,et al. Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[34] Benjamin Van Roy,et al. Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[35] Joaquin Quiñonero Candela,et al. Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[36] J. Bather,et al. Multi‐Armed Bandit Allocation Indices , 1990 .

[37] Aurélien Garivier,et al. On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.