Asymptotically optimal algorithms for budgeted multiple play bandits

We study a generalization of the multi-armed bandit problem with multiple plays where there is a cost associated with pulling each arm and the agent has a budget at each time that dictates how much she can expect to spend. We derive an asymptotic regret lower bound for any uniformly efficient algorithm in our setting. We then study a variant of Thompson sampling for Bernoulli rewards and a variant of KL-UCB for both single-parameter exponential families and bounded, finitely supported rewards. We show these algorithms are asymptotically optimal, both in rate and leading problem-dependent constants, including in the thick margin setting where multiple arms fall on the decision boundary.

[1]  Nikhil R. Devanur,et al.  Bandits with concave rewards and convex knapsacks , 2014, EC.

[2]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[3]  Nenghai Yu,et al.  Thompson Sampling for Budgeted Multi-Armed Bandits , 2015, IJCAI.

[4]  Nenghai Yu,et al.  Budgeted Multi-Armed Bandits with Multiple Plays , 2016, IJCAI.

[5]  Gábor Lugosi,et al.  Minimax Policies for Combinatorial Prediction Games , 2011, COLT.

[6]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[7]  Hiroshi Nakagawa,et al.  Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays , 2015, ICML.

[8]  Aurélien Garivier,et al.  Explore First, Exploit Next: The True Shape of Regret in Bandit Problems , 2016, Math. Oper. Res..

[9]  Yingce Xia,et al.  Infinitely Many-Armed Bandits with Budget Constraints , 2016, AAAI.

[10]  A. Burnetas,et al.  Optimal Adaptive Policies for Sequential Allocation Problems , 1996 .

[11]  Branislav Kveton,et al.  Efficient Learning in Large-Scale Combinatorial Semi-Bandits , 2014, ICML.

[12]  Zheng Wen,et al.  Matroid Bandits: Fast Combinatorial Optimization with Learning , 2014, UAI.

[13]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[14]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[15]  Antoine Chambaz,et al.  Asymptotically Optimal Algorithms for Multiple Play Bandits with Partial Feedback , 2016, ArXiv.

[16]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[17]  Alexandre Proutière,et al.  Learning to Rank , 2015, SIGMETRICS.

[18]  Archie C. Chapman,et al.  Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits , 2012, AAAI.

[19]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[20]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit: General Framework and Applications , 2013, ICML.

[21]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[22]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[23]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[24]  G. Dantzig Discrete-Variable Extremum Problems , 1957 .

[25]  Rémi Munos,et al.  Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.

[26]  Nenghai Yu,et al.  Budgeted Bandit Problems with Continuous Random Costs , 2015, ACML.

[27]  Aleksandrs Slivkins,et al.  Combinatorial Semi-Bandits with Knapsacks , 2017, AISTATS.

[28]  Zheng Wen,et al.  Combinatorial Cascading Bandits , 2015, NIPS.

[29]  A. Appendix Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays , 2015 .

[30]  Alexandre Proutière,et al.  Combinatorial Bandits Revisited , 2015, NIPS.

[31]  T. Lai Adaptive treatment allocation and the multi-armed bandit problem , 1987 .

[32]  Zheng Wen,et al.  Cascading Bandits: Learning to Rank in the Cascade Model , 2015, ICML.

[33]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[34]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[35]  Long Tran-Thanh Budget-limited multi-armed bandits , 2012 .

[36]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[37]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[38]  R. Munos,et al.  Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.