论文信息 - Deep Exploration via Bootstrapped DQN - 字舞流文

Deep Exploration via Bootstrapped DQN

Efficient exploration in complex environments remains a major challenge for reinforcement learning. We propose bootstrapped DQN, a simple algorithm that explores in a computationally and statistically efficient manner through use of randomized value functions. Unlike dithering strategies such as epsilon-greedy exploration, bootstrapped DQN carries out temporally-extended (or deep) exploration; this can lead to exponentially faster learning. We demonstrate these benefits in complex stochastic MDPs and in the large-scale Arcade Learning Environment. Bootstrapped DQN substantially improves learning times and performance across most Atari games.

Benjamin Van Roy | Charles Blundell | Ian Osband | Alexander Pritzel | Ian Osband | C. Blundell | A. Pritzel

[1] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2] D. Freedman,et al. Some Asymptotic Theory for the Bootstrap , 1981 .

[3] B. Efron. The jackknife, the bootstrap, and other resampling plans , 1987 .

[4] Raul Cano. On The Bayesian Bootstrap , 1992 .

[5] S. T. Buckland,et al. An Introduction to the Bootstrap. , 1994 .

[6] Gerald Tesauro,et al. Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[7] Apostolos Burnetas,et al. Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[8] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[9] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[10] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[11] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .

[12] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[13] Michael L. Littman,et al. A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[14] Tao Wang,et al. Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[15] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16] M. Kenward,et al. An Introduction to the Bootstrap , 2007 .

[17] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[18] Alex Graves,et al. Practical Variational Inference for Neural Networks , 2011, NIPS.

[19] Purnamrita Sarkar,et al. A scalable bootstrap for massive data , 2011, 1112.5016.

[20] Peter Dayan,et al. Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[21] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22] A. Owen,et al. Bootstrapping data arrays of arbitrary order , 2011, 1106.2125.

[23] Zheng Wen,et al. Efficient Exploration and Value Function Generalization in Deterministic Systems , 2013, NIPS.

[24] Ian. ( More ) Efficient Reinforcement Learning via Posterior Sampling , 2013 .

[25] Benjamin Van Roy,et al. (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[26] Benjamin Van Roy,et al. Near-optimal Reinforcement Learning in Factored MDPs , 2014, NIPS.

[27] Peter Dayan,et al. Bayes-Adaptive Simulation-based Search with Value Function Approximation , 2014, NIPS.

[28] Benjamin Van Roy,et al. Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[29] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[30] Rémi Munos,et al. From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning , 2014, Found. Trends Mach. Learn..

[31] Benjamin Van Roy,et al. Model-based Reinforcement Learning and the Eluder Dimension , 2014, NIPS.

[32] Sergey Levine,et al. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[33] Shie Mannor,et al. Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[34] Csaba Szepesvári,et al. Bayesian Optimal Control of Smoothly Parameterized Systems , 2015, UAI.

[35] Julien Cornebise,et al. Weight Uncertainty in Neural Network , 2015, ICML.

[36] Ryan P. Adams,et al. Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[37] Benjamin Van Roy,et al. Bootstrapped Thompson Sampling and Deep Exploration , 2015, ArXiv.

[38] Shane Legg,et al. Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[39] Julien Cornebise,et al. Weight Uncertainty in Neural Networks , 2015, ArXiv.

[40] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[41] Ariel D. Procaccia,et al. Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[42] Christoph Dann,et al. Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[43] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[44] Benjamin Van Roy,et al. Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[45] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[46] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[47] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[48] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[49] Zoubin Ghahramani,et al. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[50] Yee Whye Teh,et al. Distributed Bayesian Learning with Stochastic Natural Gradient Expectation Propagation and the Posterior Server , 2015, J. Mach. Learn. Res..

[51] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .