Competitive Experience Replay

Deep learning has achieved remarkable successes in solving challenging reinforcement learning (RL) problems when dense reward function is provided. However, in sparse reward environment it still often suffers from the need to carefully shape reward function to guide policy optimization. This limits the applicability of RL in the real world since both reinforcement learning and domain-specific knowledge are required. It is therefore of great practical importance to develop algorithms which can learn from a binary signal indicating successful task completion or other unshaped, sparse reward signals. We propose a novel method called competitive experience replay, which efficiently supplements a sparse reward by placing learning in the context of an exploration competition between a pair of agents. Our method complements the recently proposed hindsight experience replay (HER) by inducing an automatic exploratory curriculum. We evaluate our approach on the tasks of reaching various goal locations in an ant maze and manipulating objects with a robotic arm. Each task provides only binary rewards indicating whether or not the goal is achieved. Our method asymmetrically augments these sparse rewards for a pair of agents each learning the same task, creating a competitive game designed to drive exploration. Extensive experiments demonstrate that this method leads to faster converge and improved task performance.

[1]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[2]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[3]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[4]  Yuval Tassa,et al.  Data-efficient Deep Reinforcement Learning for Dexterous Manipulation , 2017, ArXiv.

[5]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[6]  Hao Liu,et al.  Shrinkage-based Bias-Variance Trade-off for Deep Reinforcement Learning , 2018 .

[7]  Roland Olsson,et al.  Inductive Functional Programming Using Incremental Program Transformation , 1995, Artif. Intell..

[8]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[9]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[10]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[11]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[12]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[13]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[14]  Pieter Abbeel,et al.  Automatic Goal Generation for Reinforcement Learning Agents , 2017, ICML.

[15]  Joan Bruna,et al.  Backplay: "Man muss immer umkehren" , 2018, ArXiv.

[16]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[17]  Justin Fu,et al.  EX2: Exploration with Exemplar Models for Deep Reinforcement Learning , 2017, NIPS.

[18]  Pieter Abbeel,et al.  Reverse Curriculum Generation for Reinforcement Learning , 2017, CoRL.

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[21]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[22]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[23]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[24]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[25]  Stewart W. Wilson,et al.  A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Controllers , 1991 .

[26]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[27]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[28]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[29]  Mo Chen,et al.  BaRC: Backward Reachability Curriculum for Robotic Reinforcement Learning , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[30]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[31]  Ilya Kostrikov,et al.  Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play , 2017, ICLR.

[32]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[33]  Ashley D. Edwards,et al.  Forward-Backward Reinforcement Learning , 2018, ArXiv.

[34]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[35]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[36]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[37]  Sergey Levine,et al.  Recall Traces: Backtracking Models for Efficient Reinforcement Learning , 2018, ICLR.

[38]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[39]  Pierre-Yves Oudeyer,et al.  Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning , 2017, J. Mach. Learn. Res..

[40]  Jürgen Schmidhuber,et al.  PowerPlay: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem , 2011, Front. Psychol..

[41]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[42]  Julian Togelius,et al.  Pommerman: A Multi-Agent Playground , 2018, AIIDE Workshops.

[43]  Sergey Levine,et al.  Path integral guided policy search , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[44]  Longxin Lin Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[45]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[46]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[47]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[48]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[49]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[50]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[51]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[52]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[53]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[54]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[55]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[56]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[57]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[58]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[59]  Jürgen Schmidhuber,et al.  First Experiments with PowerPlay , 2012, Neural networks : the official journal of the International Neural Network Society.

[60]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.