On Bayesian index policies for sequential resource allocation

This paper is about index policies for minimizing (frequentist) regret in a stochastic multi-armed bandit model, inspired by a Bayesian view on the problem. Our main contribution is to prove that the Bayes-UCB algorithm, which relies on quantiles of posterior distributions, is asymptotically optimal when the reward distributions belong to a one-dimensional exponential family, for a large class of prior distributions. We also show that the Bayesian literature gives new insight on what kind of exploration rates could be used in frequentist, UCB-type algorithms. Indeed, approximations of the Bayesian optimal solution or the Finite Horizon Gittins indices provide a justification for the kl-UCB+ and kl-UCB-H+ algorithms, whose asymptotic optimality is also established.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  R. Bellman A PROBLEM IN THE SEQUENTIAL DESIGN OF EXPERIMENTS , 1954 .

[3]  R. N. Bradt,et al.  On Sequential Designs for Maximizing the Sum of $n$ Observations , 1956 .

[4]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[5]  T. Lai Adaptive treatment allocation and the multi-armed bandit problem , 1987 .

[6]  T. Lai,et al.  Optimal stopping and dynamic allocation , 1987, Advances in Applied Probability.

[7]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[8]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[9]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[12]  Murray K. Clayton,et al.  Small-sample performance of Bernoulli two-armed bandit Bayesian strategies , 1999 .

[13]  A. Burnetas,et al.  ASYMPTOTIC BAYES ANALYSIS FOR THE FINITE-HORIZON ONE-ARMED-BANDIT PROBLEM , 2003, Probability in the Engineering and Informational Sciences.

[14]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[15]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[16]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[17]  Akimichi Takemura,et al.  An Asymptotically Optimal Bandit Algorithm for Bounded Support Models. , 2010, COLT 2010.

[18]  Jean-Yves Audibert,et al.  Regret Bounds and Minimax Policies under Partial Monitoring , 2010, J. Mach. Learn. Res..

[19]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[20]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[21]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[22]  José Niòo-Mora Computing a Classic Index for Finite-Horizon Bandits , 2011 .

[23]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[24]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[25]  José Niño-Mora,et al.  Computing a Classic Index for Finite-Horizon Bandits , 2011, INFORMS J. Comput..

[26]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[27]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[28]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[29]  R. Munos,et al.  Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[30]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[31]  S. Boucheron,et al.  Concentration inequalities : a non asymptotic theory of independence , 2013 .

[32]  Rémi Munos,et al.  Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.

[33]  Akimichi Takemura,et al.  Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors , 2013, AISTATS.

[34]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[35]  Emilie Kaufmann,et al.  Analysis of bayesian and frequentist strategies for sequential resource allocation. (Analyse de stratégies bayésiennes et fréquentistes pour l'allocation séquentielle de ressources) , 2014 .

[36]  Vaibhav Srivastava,et al.  Modeling Human Decision Making in Generalized Gaussian Multiarmed Bandits , 2013, Proceedings of the IEEE.

[37]  Sébastien Bubeck,et al.  Prior-free and prior-dependent regret bounds for Thompson Sampling , 2013, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[38]  Benjamin Van Roy,et al.  Learning to Optimize via Information-Directed Sampling , 2014, NIPS.

[39]  Lihong Li,et al.  On the Prior Sensitivity of Thompson Sampling , 2015, ALT.

[40]  Tor Lattimore,et al.  Regret Analysis of the Finite-Horizon Gittins Index Strategy for Multi-Armed Bandits , 2015, COLT.

[41]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .