GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms

In continuous action domains, standard deep reinforcement learning algorithms like DDPG suffer from inefficient exploration when facing sparse or deceptive reward problems. Conversely, evolutionary and developmental methods focusing on exploration like Novelty Search, Quality-Diversity or Goal Exploration Processes explore more robustly but are less efficient at fine-tuning policies using gradient descent. In this paper, we present the GEP-PG approach, taking the best of both worlds by sequentially combining a Goal Exploration Process and two variants of DDPG. We study the learning performance of these components and their combination on a low dimensional deceptive reward problem and on the larger Half-Cheetah benchmark. We show that DDPG fails on the former and that GEP-PG improves over the best DDPG variant in both environments. Supplementary videos and discussion can be found at http://frama.link/gep_pg, the code at this http URL.

[1]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[2]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[3]  Kenneth O. Stanley,et al.  Abandoning Objectives: Evolution Through the Search for Novelty Alone , 2011, Evolutionary Computation.

[4]  Filip De Turck,et al.  Curiosity-driven Exploration in Deep Reinforcement Learning via Bayesian Neural Networks , 2016, ArXiv.

[5]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[6]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Pierre-Yves Oudeyer,et al.  Active learning of inverse models with intrinsically motivated goal exploration in robots , 2013, Robotics Auton. Syst..

[9]  Kenneth O. Stanley,et al.  ES is more than just a traditional finite-difference approximator , 2017, GECCO.

[10]  Kenneth O. Stanley,et al.  Confronting the Challenge of Quality Diversity , 2015, GECCO.

[11]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[12]  Pierre-Yves Oudeyer,et al.  The strategic student approach for life-long exploration and learning , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[13]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[14]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[15]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[16]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[17]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[18]  Peter Henderson,et al.  Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control , 2017, ArXiv.

[19]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[20]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[21]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[22]  Pierre-Yves Oudeyer,et al.  Unsupervised Learning of Goal Spaces for Intrinsically Motivated Goal Exploration , 2018, ICLR.

[23]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[24]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[25]  Matthieu Zimmer,et al.  Bootstrapping $Q$ -Learning for Robotics From Neuro-Evolution Results , 2018, IEEE Transactions on Cognitive and Developmental Systems.

[26]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[27]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[28]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[29]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[30]  Pierre-Yves Oudeyer,et al.  Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning , 2017, J. Mach. Learn. Res..

[31]  Elman Mansimov,et al.  Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[32]  Olivier Sigaud,et al.  Robot Skill Learning: From Reinforcement Learning to Evolution Strategies , 2013, Paladyn J. Behav. Robotics.

[33]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.

[34]  Kenneth O. Stanley,et al.  Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents , 2017, NeurIPS.

[35]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[36]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[37]  Pieter Abbeel,et al.  Automatic Goal Generation for Reinforcement Learning Agents , 2017, ICML.

[38]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[39]  Yiannis Demiris,et al.  Quality and Diversity Optimization: A Unifying Modular Framework , 2017, IEEE Transactions on Evolutionary Computation.

[40]  Kenneth O. Stanley,et al.  Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning , 2017, ArXiv.

[41]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[42]  Pierre-Yves Oudeyer,et al.  Modular active curiosity-driven discovery of tool use , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).