论文信息 - An Information-Theoretic Analysis of Thompson Sampling - 字舞流文

An Information-Theoretic Analysis of Thompson Sampling

We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decision-maker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bounds that scale with the entropy of the optimal-action distribution. This strengthens preexisting results and yields new insight into how information improves performance.

Benjamin Van Roy | Daniel Russo | Daniel Russo

[1] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2] D. Teneketzis,et al. Asymptotically Efficient Adaptive Allocation Schemes for Controlled I.I.D. Processes: Finite Paramet , 1988 .

[3] Christian M. Ernst,et al. Multi-armed Bandit Allocation Indices , 1989 .

[4] R. Gray. Entropy and Information Theory , 1990, Springer New York.

[5] J. Bather,et al. Multi‐Armed Bandit Allocation Indices , 1990 .

[6] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[7] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[9] Thomas P. Hayes,et al. The Price of Bandit Information for Online Optimization , 2007, NIPS.

[10] Thomas P. Hayes,et al. Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[11] Joaquin Quiñonero Candela,et al. Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[12] John N. Tsitsiklis,et al. Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[13] Steven L. Scott,et al. A modern Bayesian look at the multi-armed bandit , 2010 .

[14] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[15] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[16] Rémi Munos,et al. Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[17] Warren B. Powell,et al. The Knowledge Gradient Algorithm for a General Class of Online Learning Problems , 2012, Oper. Res..

[18] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[19] S. Kakade,et al. Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2012, IEEE Transactions on Information Theory.

[20] David S. Leslie,et al. Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2012, J. Mach. Learn. Res..

[21] Shipra Agrawal,et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[22] Lihong Li,et al. Generalized Thompson Sampling for Contextual Bandits , 2013, ArXiv.

[23] Shie Mannor,et al. Thompson Sampling for Complex Bandit Problems , 2013, ArXiv.

[24] Shipra Agrawal,et al. Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[25] Liang Tang,et al. Automatic ad format selection via contextual bandits , 2013, CIKM.

[26] Shipra Agrawal,et al. Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[27] Rémi Munos,et al. Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.

[28] Benjamin Van Roy,et al. Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[29] Gábor Lugosi,et al. Regret in Online Combinatorial Optimization , 2012, Math. Oper. Res..

[30] Shie Mannor,et al. Thompson Sampling for Complex Online Problems , 2013, ICML.

[31] Sébastien Bubeck,et al. Prior-free and prior-dependent regret bounds for Thompson Sampling , 2013, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[32] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .