Parallelizing Thompson Sampling

How can we make use of information parallelism in online decision making problems while efficiently balancing the exploration-exploitation trade-off? In this paper, we introduce a batch Thompson Sampling framework for two canonical online decision making problems, namely, stochastic multi-arm bandit and linear contextual bandit with finitely many arms. Over a time horizon T , our batch Thompson Sampling policy achieves the same (asymptotic) regret bound of a fully sequential one while carrying out only O(log T ) batch queries. To achieve this exponential reduction, i.e., reducing the number of interactions from T to O(log T ), our batch policy dynamically determines the duration of each batch in order to balance the exploration-exploitation trade-off. We also demonstrate experimentally that dynamic batch allocation dramatically outperforms natural baselines such as static batch allocations.

[1]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[2]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[3]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[4]  Peter S. Fader,et al.  Customer Acquisition via Display Advertising Using Multi-Armed Bandit Experiments , 2016, Mark. Sci..

[5]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[6]  Shipra Agrawal,et al.  Near-Optimal Regret Bounds for Thompson Sampling , 2017, J. ACM.

[7]  Amin Karbasi,et al.  Regret Bounds for Batched Bandits , 2019, AAAI.

[8]  Xiangyang Ji,et al.  Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition , 2020, NeurIPS.

[9]  Amin Karbasi,et al.  Adaptivity in Adaptive Submodularity , 2019, COLT.

[10]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[11]  Andreas Krause,et al.  Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization , 2012, ICML.

[12]  Yanjun Han,et al.  Batched Multi-armed Bandits Problem , 2019, NeurIPS.

[13]  Sébastien Bubeck,et al.  Prior-free and prior-dependent regret bounds for Thompson Sampling , 2013, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[14]  Andreas Krause,et al.  Near-optimal Batch Mode Active Learning and Adaptive Submodular Optimization , 2013, ICML.

[15]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[16]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[17]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[18]  Morteza Zadimoghaddam,et al.  Submodular Maximization with Nearly Optimal Approximation, Adaptivity and Query Complexity , 2018, SODA.

[19]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[20]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[21]  Nicolas Vayatis,et al.  Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration , 2013, ECML/PKDD.

[22]  Vianney Perchet,et al.  Batched Bandit Problems , 2015, COLT.

[23]  Long Tran-Thanh,et al.  Efficient Thompson Sampling for Online Matrix-Factorization Recommendation , 2015, NIPS.

[24]  Yuan Zhou,et al.  Linear bandits with limited adaptivity and learning distributional optimal design , 2020, STOC.

[25]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[26]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[27]  Joaquin Quiñonero Candela,et al.  Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[28]  Xiaokui Xiao,et al.  MOTS: Minimax Optimal Thompson Sampling , 2020, ArXiv.

[29]  Volkan Cevher,et al.  High-Dimensional Bayesian Optimization via Additive Models with Overlapping Groups , 2018, AISTATS.

[30]  Amin Karbasi,et al.  Minimax Regret of Switching-Constrained Online Convex Optimization: No Phase Transition , 2020, NeurIPS.

[31]  Kirthevasan Kandasamy,et al.  Parallelised Bayesian Optimisation via Thompson Sampling , 2018, AISTATS.

[32]  Arpit Agarwal,et al.  Stochastic Submodular Cover with Limited Adaptivity , 2019, SODA.

[33]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[34]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[35]  Lalit Jain,et al.  Sequential Experimental Design for Transductive Linear Bandits , 2019, NeurIPS.

[36]  Eric Balkanski,et al.  Parallelization does not Accelerate Convex Optimization: Adaptivity Lower Bounds for Non-smooth Convex Minimization , 2018, ArXiv.

[37]  Jack Bowden,et al.  Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. , 2015, Statistical science : a review journal of the Institute of Mathematical Statistics.

[38]  Pushmeet Kohli,et al.  Batched Gaussian Process Bandit Optimization via Determinantal Point Processes , 2016, NIPS.

[39]  Rémi Munos,et al.  A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences , 2011, COLT.

[40]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[41]  Amin Karbasi,et al.  Unconstrained submodular maximization with constant adaptive complexity , 2019, STOC.

[42]  Ayfer Özgür,et al.  Batched Thompson Sampling , 2021, NeurIPS.

[43]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[44]  Zi Wang,et al.  Batched Large-scale Bayesian Optimization in High-dimensional Spaces , 2017, AISTATS.

[45]  Lalit Jain,et al.  Improved Confidence Bounds for the Linear Logistic Model and Applications to Bandits , 2021, ICML.

[46]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[47]  Donald A. Berry,et al.  Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[48]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[49]  Eric Balkanski,et al.  The adaptive complexity of maximizing a submodular function , 2018, STOC.