An Information-Theoretic Analysis of Thompson Sampling

We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decision-maker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bounds that scale with the entropy of the optimal-action distribution. This strengthens preexisting results and yields new insight into how information improves performance.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  D. Teneketzis,et al.  Asymptotically Efficient Adaptive Allocation Schemes for Controlled I.I.D. Processes: Finite Paramet , 1988 .

[3]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[4]  R. Gray Entropy and Information Theory , 1990, Springer New York.

[5]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[6]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[7]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[9]  Thomas P. Hayes,et al.  The Price of Bandit Information for Online Optimization , 2007, NIPS.

[10]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[11]  Joaquin Quiñonero Candela,et al.  Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[12]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[13]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[14]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[15]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[16]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[17]  Warren B. Powell,et al.  The Knowledge Gradient Algorithm for a General Class of Online Learning Problems , 2012, Oper. Res..

[18]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[19]  S. Kakade,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2012, IEEE Transactions on Information Theory.

[20]  David S. Leslie,et al.  Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2012, J. Mach. Learn. Res..

[21]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[22]  Lihong Li,et al.  Generalized Thompson Sampling for Contextual Bandits , 2013, ArXiv.

[23]  Shie Mannor,et al.  Thompson Sampling for Complex Bandit Problems , 2013, ArXiv.

[24]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[25]  Liang Tang,et al.  Automatic ad format selection via contextual bandits , 2013, CIKM.

[26]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[27]  Rémi Munos,et al.  Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.

[28]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[29]  Gábor Lugosi,et al.  Regret in Online Combinatorial Optimization , 2012, Math. Oper. Res..

[30]  Shie Mannor,et al.  Thompson Sampling for Complex Online Problems , 2013, ICML.

[31]  Sébastien Bubeck,et al.  Prior-free and prior-dependent regret bounds for Thompson Sampling , 2013, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[32]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .