Linear Thompson Sampling Revisited

We derive an alternative proof for the regret of Thompson sampling (\ts) in the stochastic linear bandit setting. While we obtain a regret bound of order $\wt{O}(d^{3/2}\sqrt{T})$ as in previous results, the proof sheds new light on the functioning of the \ts. We leverage on the structure of the problem to show how the regret is related to the sensitivity (i.e., the gradient) of the objective function and how selecting optimal arms associated to \textit{optimistic} parameters does control it. Thus we show that \ts can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrt{d}$ regret factor compared to a UCB-like approach. Furthermore, we show that our proof can be readily applied to regularized linear optimization and generalized linear model problems.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[3]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[4]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[5]  Feng Qi (祁锋),et al.  COMPLETELY MONOTONIC FUNCTION ASSOCIATED WITH THE GAMMA FUNCTIONS AND PROOF OF WALLIS' INEQUALITY , 2005 .

[6]  Constantin P. Niculescu,et al.  Convex Functions and Their Applications: A Contemporary Approach , 2005 .

[7]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[8]  Csaba Szepesvári,et al.  Online Least Squares Estimation with Self-Normalized Processes: An Application to Bandit Problems , 2011, ArXiv.

[9]  S. Li Concise Formulas for the Area and Volume of a Hyperspherical Cap , 2011 .

[10]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[11]  Laurence B. Milstein,et al.  Chernoff-Type Bounds for the Gaussian Error Function , 2011, IEEE Transactions on Communications.

[12]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[13]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[14]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[15]  David S. Leslie,et al.  Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2012, J. Mach. Learn. Res..

[16]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[17]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[18]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[19]  Rémi Munos,et al.  Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.

[20]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[21]  Ambuj Tewari,et al.  Online Linear Optimization via Smoothing , 2014, COLT.

[22]  Sébastien Bubeck,et al.  Prior-free and prior-dependent regret bounds for Thompson Sampling , 2013, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[23]  Shie Mannor,et al.  Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[24]  Branislav Kveton,et al.  Efficient Learning in Large-Scale Combinatorial Semi-Bandits , 2014, ICML.

[25]  Ambuj Tewari,et al.  Fighting Bandits with a New Kind of Smoothness , 2015, NIPS.

[26]  Benjamin Van Roy,et al.  Bootstrapped Thompson Sampling and Deep Exploration , 2015, ArXiv.

[27]  Benjamin Van Roy,et al.  An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..