Thompson Sampling for 1-Dimensional Exponential Family Bandits

Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for 1-dimensional exponential family bandits. Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (through the Jeffreys prior) available in an exponential family. This allow us to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families.

[1]  Sébastien Bubeck,et al.  Prior-free and prior-dependent regret bounds for Thompson Sampling , 2013, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[2]  Sébastien Bubeck,et al.  A note on the Bayesian regret of Thompson Sampling with an arbitrary prior , 2013, ArXiv.

[3]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[4]  R. Munos,et al.  Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[5]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[6]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[7]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[8]  David S. Leslie,et al.  Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2012, J. Mach. Learn. Res..

[9]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[10]  Larry Wasserman,et al.  All of Statistics: A Concise Course in Statistical Inference , 2004 .

[11]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[13]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[14]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[15]  Akimichi Takemura,et al.  An Asymptotically Optimal Bandit Algorithm for Bounded Support Models. , 2010, COLT 2010.

[16]  G. Lugosi Concentration Inequalities , 2008, COLT.

[17]  A. V. D. Vaart,et al.  Asymptotic Statistics: U -Statistics , 1998 .

[18]  25th Annual Conference on Learning Theory Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2022 .

[19]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .