论文信息 - Improving Policy Gradient by Exploring Under-appreciated Rewards - 字舞流文

Improving Policy Gradient by Exploring Under-appreciated Rewards

This paper presents a novel form of policy gradient for model-free reinforcement learning (RL) with improved exploration properties. Current policy-based methods use entropy regularization to encourage undirected exploration of the reward landscape, which is ineffective in high dimensional spaces with sparse rewards. We propose a more directed exploration strategy that promotes exploration of under-appreciated reward regions. An action sequence is considered under-appreciated if its log-probability under the current policy under-estimates its resulting reward. The proposed exploration strategy is easy to implement, requiring small modifications to an implementation of the REINFORCE algorithm. We evaluate the approach on a set of algorithmic tasks that have long challenged RL methods. Our approach reduces hyper-parameter sensitivity and demonstrates significant improvements over baseline methods. Our algorithm successfully solves a benchmark multi-digit addition task and generalizes to long sequences. This is, to our knowledge, the first time that a pure RL method has solved addition using only reward feedback.

Dale Schuurmans | Mohammad Norouzi | Ofir Nachum | Dale Schuurmans | Ofir Nachum | Mohammad Norouzi

[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[2] Roy Fox,et al. Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[3] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[4] Jürgen Schmidhuber,et al. Optimal Artiﬁcial Curiosity, Creativity, Music, and the Fine Arts , 2005 .

[5] Luc De Raedt,et al. Inductive Logic Programming: Theory and Methods , 1994, J. Log. Program..

[6] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[7] Wojciech Zaremba,et al. Reinforcement Learning Neural Turing Machines - Revised , 2015 .

[8] Gene H. Golub,et al. Some modified matrix eigenvalue problems , 1973, Milestones in Matrix Computation.

[9] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[10] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[11] Lukasz Kaiser,et al. Neural GPUs Learn Algorithms , 2015, ICLR.

[12] J. Andrew Bagnell,et al. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[13] Wojciech Zaremba,et al. Learning to Execute , 2014, ArXiv.

[14] Hilbert J. Kappen,et al. Dynamic policy programming , 2010, J. Mach. Learn. Res..

[15] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[16] Leslie Pack Kaelbling,et al. Learning in embedded systems , 1993 .

[17] Yoshua Bengio,et al. Reweighted Wake-Sleep , 2014, ICLR.

[18] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[19] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[20] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[21] Dale Schuurmans,et al. Reward Augmented Maximum Likelihood for Neural Structured Prediction , 2016, NIPS.

[22] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[23] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[24] Kavosh Asadi,et al. A New Softmax Operator for Reinforcement Learning , 2016, ArXiv.

[25] Quoc V. Le,et al. Neural Programmer: Inducing Latent Programs with Gradient Descent , 2015, ICLR.

[26] Jing Peng,et al. Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[27] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[28] Michel Tokic. Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences , 2010 .

[29] Joshua B. Tenenbaum,et al. Learning and using relational theories , 2007, NIPS.

[30] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31] Dana Angluin,et al. Learning Regular Sets from Queries and Counterexamples , 1987, Inf. Comput..

[32] Tom Schaul,et al. Episodic Reinforcement Learning by Logistic Reward-Weighted Regression , 2008, ICANN.

[33] Kavosh Asadi,et al. An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[34] Omer Levy,et al. Published as a conference paper at ICLR 2018 S IMULATING A CTION D YNAMICS WITH N EURAL P ROCESS N ETWORKS , 2018 .

[35] John Langford,et al. Efficient Exploration in Reinforcement Learning , 2017, Encyclopedia of Machine Learning and Data Mining.

[36] Stefan Schaal,et al. Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[37] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[38] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[39] Peter Norvig,et al. Artificial Intelligence: A Modern Approach , 1995 .

[40] Kevin P. Murphy,et al. Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[41] Sergey Levine,et al. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[42] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[43] Wojciech Zaremba,et al. Reinforcement Learning Neural Turing Machines , 2015, ArXiv.

[44] Nando de Freitas,et al. Neural Programmer-Interpreters , 2015, ICLR.

[45] Sergio Gomez Colmenarejo,et al. Hybrid computing using a neural network with dynamic external memory , 2016, Nature.