Learning to Explore with Meta-Policy Gradient

The performance of off-policy learning, including deep Q-learning and deep deterministic policy gradient (DDPG), critically depends on the choice of the exploration policy. Existing exploration methods are mostly based on adding noise to the on-going actor policy and can only explore \emph{local} regions close to what the actor policy dictates. In this work, we develop a simple meta-policy gradient algorithm that allows us to adaptively learn the exploration policy in DDPG. Our algorithm allows us to train flexible exploration behaviors that are independent of the actor policy, yielding a \emph{global exploration} that significantly speeds up the learning process. With an extensive study, we show that our method significantly improves the sample-efficiency of DDPG on a variety of reinforcement learning tasks.

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[3]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[4]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[5]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[6]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[7]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[8]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[9]  Peter Stone,et al.  Intrinsically motivated model learning for developing curious robots , 2017, Artif. Intell..

[10]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[11]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[12]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[13]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[14]  Pierre-Yves Oudeyer,et al.  What is Intrinsic Motivation? A Typology of Computational Approaches , 2007, Frontiers Neurorobotics.

[15]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[16]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[17]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[18]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[19]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[20]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[21]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[22]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[23]  Manuel Lopes,et al.  Learning exploration strategies in model-based reinforcement learning , 2013, AAMAS.

[24]  R. Mazo On the theory of brownian motion , 1973 .

[25]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[26]  Jürgen Schmidhuber,et al.  Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[27]  Pawel Wawrzynski,et al.  Real-time reinforcement learning by sequential Actor-Critics and experience replay , 2009, Neural Networks.

[28]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.