论文信息 - The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits

The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits

Stochastic linear bandits are a natural and simple generalisation of finite-armed bandits with numerous practical applications. Current approaches focus on generalising existing techniques for finite-armed bandits, notably the optimism principle and Thompson sampling. While prior work has mostly been in the worst-case setting, we analyse the asymptotic instance-dependent regret and show matching upper and lower bounds on what is achievable. Surprisingly, our results show that no algorithm based on optimism or Thompson sampling will ever achieve the optimal rate, and indeed, can be arbitrarily far from optimal, even in very simple cases. This is a disturbing result because these techniques are standard tools that are widely used for sequential optimisation. For example, for generalised linear bandits and reinforcement learning.

Tor Lattimore | Csaba Szepesvári | Csaba Szepesvari | Tor Lattimore

[1] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2] D. Teneketzis,et al. Asymptotically Efficient Adaptive Allocation Schemes for Controlled I.I.D. Processes: Finite Paramet , 1988 .

[3] Nicolò Cesa-Bianchi,et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[4] R. Agrawal. Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[5] H Robbins,et al. Sequential choice from several populations. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[6] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[7] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8] Baruch Awerbuch,et al. Adaptive routing with end-to-end feedback: distributed learning and geometric approaches , 2004, STOC '04.

[9] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[10] Thomas P. Hayes,et al. Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[11] Alexandre B. Tsybakov,et al. Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[12] Jean-Yves Audibert,et al. Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[13] John N. Tsitsiklis,et al. Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[14] Aurélien Garivier,et al. Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[15] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[16] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[17] Rémi Munos,et al. Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.