论文信息 - Stochastic Activation Actor Critic Methods - 字舞流文

Stochastic Activation Actor Critic Methods

Stochastic elements in reinforcement learning (RL) have shown promise to improve exploration and handling of uncertainty, such as the utilization of stochastic weights in NoisyNets and stochastic policies in the maximum entropy RL frameworks. Yet effective and general approaches to include such elements in actor-critic models are still lacking. Inspired by the aforementioned techniques, we propose an effective way to inject randomness into actor-critic models to improve general exploratory behavior and reflect environment uncertainty. Specifically, randomness is added at the level of intermediate activations that feed into both policy and value functions to achieve better correlated and more complex perturbations. The proposed framework also features flexibility and simplicity, which allows straightforward adaptation to a variety of tasks. We test several actor-critic models enhanced with stochastic activations and demonstrate their effectiveness in a wide range of Atari 2600 games, a continuous control problem and a car racing task. Lastly, in a qualitative analysis, we present evidence of the proposed model adapting the noise in the policy and value functions to reflect uncertainty and ambiguity in the environment.

Max Welling | Herke van Hoof | Wenling Shang | Douwe van der Wal | M. Welling | H. V. Hoof | Wenling Shang | D. V. D. Wal

[1] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[2] Sergey Levine,et al. Uncertainty-Aware Reinforcement Learning for Collision Avoidance , 2017, ArXiv.

[3] Koray Kavukcuoglu,et al. PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[4] Shie Mannor,et al. Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[5] Henryk Michalewski,et al. Distributed Deep Reinforcement Learning: Learn how to play Atari games in 21 minutes , 2018, ISC.

[6] Max Welling,et al. Variational Dropout and the Local Reparameterization Trick , 2015, NIPS 2015.

[7] Emanuel Todorov,et al. General duality between optimal control and estimation , 2008, 2008 47th IEEE Conference on Decision and Control.

[8] Marcin Andrychowicz,et al. Parameter Space Noise for Exploration , 2017, ICLR.

[9] Nando de Freitas,et al. Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[10] Marlos C. Machado,et al. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents (Extended Abstract) , 2018, IJCAI.

[11] Kevin P. Murphy,et al. A Survey of POMDP Solution Techniques , 2000 .

[12] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[13] Zoubin Ghahramani,et al. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[14] Honglak Lee,et al. Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units , 2016, ICML.

[15] Julien Cornebise,et al. Weight Uncertainty in Neural Networks , 2015, ArXiv.

[16] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[17] Samy Bengio,et al. Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[18] Elman Mansimov,et al. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[19] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[20] Yuandong Tian,et al. Latent forward model for Real-time Strategy game planning with incomplete information , 2018 .

[21] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[22] Pieter Abbeel,et al. Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[23] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[24] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[25] Shane Legg,et al. Noisy Networks for Exploration , 2017, ICLR.

[26] Nicholas Roy,et al. The Belief Roadmap: Efficient Planning in Linear POMDPs by Factoring the Covariance , 2007, ISRR.

[27] Zeb Kurth-Nelson,et al. Learning to reinforcement learn , 2016, CogSci.

[28] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[29] Benjamin Van Roy,et al. Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[30] Demis Hassabis,et al. Neural Episodic Control , 2017, ICML.

[31] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[32] Max Welling,et al. Bayesian Compression for Deep Learning , 2017, NIPS.

[33] Marc G. Bellemare,et al. Count-Based Exploration with Neural Density Models , 2017, ICML.

[34] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[35] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[36] Carl E. Rasmussen,et al. Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[37] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[38] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[39] Zheng Wen,et al. Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..