Boltzmann Exploration Done Right
暂无分享,去创建一个
Claudio Gentile | Nicolò Cesa-Bianchi | Gergely Neu | Gábor Lugosi | G. Lugosi | N. Cesa-Bianchi | C. Gentile | Gergely Neu
[1] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.
[2] Nicolò Cesa-Bianchi,et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.
[3] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..
[4] Nicolò Cesa-Bianchi,et al. Finite-Time Regret Bounds for the Multiarmed Bandit Problem , 1998, ICML.
[5] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.
[6] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..
[7] Doina Precup,et al. A Convergent Form of Approximate Policy Iteration , 2002, NIPS.
[8] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.
[9] Tommi S. Jaakkola,et al. Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.
[10] Mehryar Mohri,et al. Multi-armed Bandit Algorithms and Empirical Evaluation , 2005, ECML.
[11] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[12] Jean-Yves Audibert,et al. Minimax Policies for Bandits Games , 2009, COLT 2009.
[13] Peter Auer,et al. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..
[14] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.
[15] O. Catoni. Challenging the empirical mean and empirical variance: a deviation study , 2010, 1009.2048.
[16] Rémi Munos,et al. Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.
[17] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..
[18] Sanjeev Arora,et al. The Multiplicative Weights Update Method: a Meta-Algorithm and Applications , 2012, Theory Comput..
[19] R. Munos,et al. Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.
[20] Shipra Agrawal,et al. Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.
[21] Ambuj Tewari,et al. Online Linear Optimization via Smoothing , 2014, COLT.
[22] Aleksandrs Slivkins,et al. One Practical Algorithm for Both Stochastic and Adversarial Bandits , 2014, ICML.
[23] Doina Precup,et al. Algorithms for multi-armed bandit problems , 2014, ArXiv.
[24] Tor Lattimore,et al. Optimally Confident UCB : Improved Regret for Finite-Armed Bandits , 2015, ArXiv.
[25] Benjamin Van Roy,et al. Generalization and Exploration via Randomized Value Functions , 2014, ICML.
[26] Tor Lattimore,et al. Regret Analysis of the Anytime Optimally Confident UCB Algorithm , 2016, ArXiv.
[27] Tor Lattimore,et al. On Explore-Then-Commit strategies , 2016, NIPS.
[28] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .