The Uncertainty Bellman Equation and Exploration
暂无分享,去创建一个
Ian Osband | Brendan O'Donoghue | Volodymyr Mnih | Rémi Munos | R. Munos | Volodymyr Mnih | Ian Osband | Brendan O'Donoghue
[1] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .
[2] Benjamin Van Roy,et al. On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.
[3] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.
[4] Marc G. Bellemare,et al. Count-Based Exploration with Neural Density Models , 2017, ICML.
[5] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.
[6] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.
[7] Hilbert J. Kappen,et al. On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.
[8] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..
[9] Benjamin Van Roy,et al. Generalization and Exploration via Randomized Value Functions , 2014, ICML.
[10] David K. Smith,et al. Dynamic Programming and Optimal Control. Volume 1 , 1996 .
[11] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .
[12] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.
[13] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.
[14] Tor Lattimore,et al. PAC Bounds for Discounted MDPs , 2012, ALT.
[15] M. J. Sobel. The variance of discounted Markov decision processes , 1982 .
[16] Aurélien Garivier,et al. On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.
[17] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.
[18] Nuttapong Chentanez,et al. Intrinsically Motivated Reinforcement Learning , 2004, NIPS.
[19] Shie Mannor,et al. Learning the Variance of the Reward-To-Go , 2016, J. Mach. Learn. Res..
[20] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.
[21] Rémi Munos,et al. From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning , 2014, Found. Trends Mach. Learn..
[22] Martha White,et al. Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains , 2010, NIPS.
[23] Peter Dayan,et al. Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.
[24] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.
[25] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.
[26] John N. Tsitsiklis,et al. Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.
[27] Zheng Wen,et al. Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..
[28] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.
[29] Catholijn M. Jonker,et al. Efficient exploration with Double Uncertain Value Networks , 2017, ArXiv.
[30] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .
[31] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.
[32] Sergey Levine,et al. Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.
[33] Jonathan P. How,et al. Sample Efficient Reinforcement Learning with Gaussian Processes , 2014, ICML.
[34] Tor Lattimore,et al. Regret Analysis of the Anytime Optimally Confident UCB Algorithm , 2016, ArXiv.
[35] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .
[36] J. Berger. Statistical Decision Theory and Bayesian Analysis , 1988 .
[37] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..
[38] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..
[39] Jürgen Schmidhuber,et al. Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes , 2008, ABiALS.
[40] Benjamin Van Roy,et al. (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.
[41] Benjamin Van Roy,et al. Why is Posterior Sampling Better than Optimism for Reinforcement Learning? , 2016, ICML.
[42] Koray Kavukcuoglu,et al. Combining policy gradient and Q-learning , 2016, ICLR.
[43] R. Jackson. Inequalities , 2007, Algebra for Parents.
[44] John N. Tsitsiklis,et al. Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..
[45] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[46] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .
[47] Robert H. Halstead,et al. Matrix Computations , 2011, Encyclopedia of Parallel Computing.
[48] Justin A. Boyan,et al. Least-Squares Temporal Difference Learning , 1999, ICML.