论文信息 - On Approximate Thompson Sampling with Langevin Algorithms - 字舞流文

On Approximate Thompson Sampling with Langevin Algorithms

Thompson sampling for multi-armed bandit problems is known to enjoy favorable performance in both theory and practice. However, its wider deployment is restricted due to a significant computational limitation: the need for samples from posterior distributions at every iteration. In practice, this limitation is alleviated by making use of approximate sampling methods, yet provably incorporating approximate samples into Thompson Sampling algorithms remains an open problem. In this work we address this by proposing two efficient Langevin MCMC algorithms tailored to Thompson sampling. The resulting approximate Thompson Sampling algorithms are efficiently implementable and provably achieve optimal instance-dependent regret for the MultiArmed Bandit (MAB) problem. To prove these results we derive novel posterior concentration bounds and MCMC convergence rates for logconcave distributions which may be of independent interest.

Michael I. Jordan | Peter L. Bartlett | Yi-An Ma | Aldo Pacchiano | Eric Mazumdar | Eric V. Mazumdar | P. Bartlett | Yi-An Ma | Aldo Pacchiano

[1] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2] S. Basu,et al. The Mean, Median, and Mode of Unimodal Distributions:A Characterization , 1997 .

[3] S. Shreve,et al. Stochastic differential equations , 1955, Mathematical Proceedings of the Cambridge Philosophical Society.

[4] M. Ledoux. Concentration of measure and logarithmic Sobolev inequalities , 1999 .

[5] A. V. D. Vaart,et al. Convergence rates of posterior distributions , 2000 .

[6] L. Wasserman,et al. Rates of convergence of posterior distributions , 2001 .

[7] M. Ledoux. The concentration of measure phenomenon , 2001 .

[8] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[9] Yaofeng Ren. On the Burkholder-Davis-Gundy inequalities for continuous martingales , 2008 .

[10] C. Villani. Optimal Transport: Old and New , 2008 .

[11] Van Der Vaart,et al. Rates of contraction of posterior distributions based on Gaussian process priors , 2008 .

[12] Steven L. Scott,et al. A modern Bayesian look at the multi-armed bandit , 2010 .

[13] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[14] Yee Whye Teh,et al. Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[15] Rémi Munos,et al. Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[16] Shipra Agrawal,et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[17] Shipra Agrawal,et al. Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[18] Shipra Agrawal,et al. Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[19] Rémi Munos,et al. Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.

[20] J. Wellner,et al. Log-Concavity and Strong Log-Concavity: a review. , 2014, Statistics surveys.

[21] Shie Mannor,et al. Thompson Sampling for Complex Online Problems , 2013, ICML.

[22] Tianqi Chen,et al. A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.

[23] É. Moulines,et al. Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm , 2015, 1507.05021.

[24] É. Moulines,et al. Sampling from a strongly log-concave distribution with the Unadjusted Langevin Algorithm , 2016 .

[25] C. Gomez-Uribe. Online Algorithms For Parameter Mean And Variance Estimation In Dynamic Regression Models , 2016, 1605.05697.

[26] Benjamin Van Roy,et al. An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[27] Benjamin Van Roy,et al. Ensemble Sampling , 2017, NIPS.

[28] Alessandro Lazaric,et al. Linear Thompson Sampling Revisited , 2016, AISTATS.

[29] Iñigo Urteaga,et al. Variational inference for the multi-armed contextual bandit , 2017, AISTATS.

[30] Peter L. Bartlett,et al. Convergence of Langevin MCMC in KL-divergence , 2017, ALT.

[31] Jasper Snoek,et al. Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[32] A. V. D. Vaart,et al. CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS FOR NONIID OBSERVATIONS By , 2018 .

[33] Arnak S. Dalalyan,et al. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient , 2017, Stochastic Processes and their Applications.

[34] Michael I. Jordan,et al. Sampling can be faster than optimization , 2018, Proceedings of the National Academy of Sciences.

[35] Yasin Abbasi-Yadkori,et al. Thompson Sampling and Approximate Inference , 2019, NeurIPS.

[36] Michael I. Jordan,et al. A Diffusion Process Perspective on Posterior Contraction Rates for Parameters , 2019, 1909.00966.

[37] Santosh S. Vempala,et al. Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices , 2019, NeurIPS.

[38] Michael I. Jordan,et al. A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm , 2019, ArXiv.

[39] Csaba Szepesvari,et al. Bandit Algorithms , 2020 .

[40] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .