On Thompson Sampling with Langevin Algorithms

Thompson sampling for multi-armed bandit problems is known to enjoy favorable performance in both theory and practice. However, it suffers from a significant limitation computationally, arising from the need for samples from posterior distributions at every iteration. We propose two Markov Chain Monte Carlo (MCMC) methods tailored to Thompson sampling to address this issue. We construct quickly converging Langevin algorithms to generate approximate samples that have accuracy guarantees, and we leverage novel posterior concentration rates to analyze the regret of the resulting approximate Thompson sampling algorithm. Further, we specify the necessary hyperparameters for the MCMC procedure to guarantee optimal instance-dependent frequentist regret while having low computational complexity. In particular, our algorithms take advantage of both posterior concentration and a sample reuse mechanism to ensure that only a constant number of iterations and a constant amount of data is needed in each round. The resulting approximate Thompson sampling algorithm has logarithmic regret and its computational complexity does not scale with the time horizon of the algorithm.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  S. Basu,et al.  The Mean, Median, and Mode of Unimodal Distributions:A Characterization , 1997 .

[3]  S. Shreve,et al.  Stochastic differential equations , 1955, Mathematical Proceedings of the Cambridge Philosophical Society.

[4]  M. Ledoux Concentration of measure and logarithmic Sobolev inequalities , 1999 .

[5]  A. V. D. Vaart,et al.  Convergence rates of posterior distributions , 2000 .

[6]  L. Wasserman,et al.  Rates of convergence of posterior distributions , 2001 .

[7]  M. Ledoux The concentration of measure phenomenon , 2001 .

[8]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[9]  Yaofeng Ren On the Burkholder-Davis-Gundy inequalities for continuous martingales , 2008 .

[10]  C. Villani Optimal Transport: Old and New , 2008 .

[11]  Van Der Vaart,et al.  Rates of contraction of posterior distributions based on Gaussian process priors , 2008 .

[12]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[13]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[14]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[15]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[16]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[17]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[18]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[19]  Rémi Munos,et al.  Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.

[20]  J. Wellner,et al.  Log-Concavity and Strong Log-Concavity: a review. , 2014, Statistics surveys.

[21]  Shie Mannor,et al.  Thompson Sampling for Complex Online Problems , 2013, ICML.

[22]  Tianqi Chen,et al.  A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.

[23]  É. Moulines,et al.  Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm , 2015, 1507.05021.

[24]  É. Moulines,et al.  Sampling from a strongly log-concave distribution with the Unadjusted Langevin Algorithm , 2016 .

[25]  C. Gomez-Uribe Online Algorithms For Parameter Mean And Variance Estimation In Dynamic Regression Models , 2016, 1605.05697.

[26]  Benjamin Van Roy,et al.  An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[27]  Benjamin Van Roy,et al.  Ensemble Sampling , 2017, NIPS.

[28]  Alessandro Lazaric,et al.  Linear Thompson Sampling Revisited , 2016, AISTATS.

[29]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[30]  Iñigo Urteaga,et al.  Variational inference for the multi-armed contextual bandit , 2017, AISTATS.

[31]  Peter L. Bartlett,et al.  Convergence of Langevin MCMC in KL-divergence , 2017, ALT.

[32]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[33]  A. V. D. Vaart,et al.  CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS FOR NONIID OBSERVATIONS By , 2018 .

[34]  Arnak S. Dalalyan,et al.  User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient , 2017, Stochastic Processes and their Applications.

[35]  Michael I. Jordan,et al.  Sampling can be faster than optimization , 2018, Proceedings of the National Academy of Sciences.

[36]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[37]  Yasin Abbasi-Yadkori,et al.  Thompson Sampling and Approximate Inference , 2019, NeurIPS.

[38]  Michael I. Jordan,et al.  A Diffusion Process Perspective on Posterior Contraction Rates for Parameters , 2019, 1909.00966.

[39]  Santosh S. Vempala,et al.  Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices , 2019, NeurIPS.

[40]  Michael I. Jordan,et al.  A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm , 2019, ArXiv.

[41]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[42]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .