BaRC: Backward Reachability Curriculum for Robotic Reinforcement Learning

Model-free Reinforcement Learning (RL) offers an attractive approach to learn control policies for high dimensional systems, but its relatively poor sample complexity often necessitates training in simulated environments. Even in simulation, goal-directed tasks whose natural reward function is sparse remain intractable for state-of-the-art model-free algorithms for continuous control. The bottleneck in these tasks is the prohibitive amount of exploration required to obtain a learning signal from the initial state of the system. In this work, we leverage physical priors in the form of an approximate system dynamics model to design a curriculum for a model-free policy optimization algorithm. Our Backward Reachability Curriculum (BaRC) begins policy training from states that require a small number of actions to accomplish the task, and expands the initial state distribution backwards in a dynamically-consistent manner once the policy optimization algorithm demonstrates sufficient performance. BaRC is general, in that it can accelerate training of any model-free RL algorithm on a broad class of goal-directed continuous control MDPs. Its curriculum strategy is physically intuitive, easy-to-tune, and allows incorporating physical priors to accelerate training without hindering the performance, flexibility, and applicability of the model-free RL algorithm. We evaluate our approach on two representative dynamic robotic learning problems and find substantial performance improvement relative to previous curriculum generation techniques and naive exploration strategies.

[1]  Pierre-Yves Oudeyer,et al.  Active learning of inverse models with intrinsically motivated goal exploration in robots , 2013, Robotics Auton. Syst..

[2]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[3]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[4]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[5]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[6]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[7]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[8]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[9]  Sergey Levine,et al.  MBMF: Model-Based Priors for Model-Free Reinforcement Learning , 2017, ArXiv.

[10]  Marco Pavone,et al.  Fast marching tree: A fast marching sampling-based method for optimal motion planning in many dimensions , 2013, ISRR.

[11]  Claire J. Tomlin,et al.  Annual Review of Control , Robotics , and Autonomous Systems Hamilton – Jacobi Reachability : Some Recent Theoretical Advances and Applications in Unmanned Airspace Management , 2019 .

[12]  Jur P. van den Berg,et al.  Kinodynamic RRT*: Asymptotically optimal motion planning for robots with linear dynamics , 2013, 2013 IEEE International Conference on Robotics and Automation.

[13]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[14]  Marco Pavone,et al.  Group Marching Tree: Sampling-Based Approximately Optimal Motion Planning on GPUs , 2017, 2017 First IEEE International Conference on Robotic Computing (IRC).

[15]  E. Coddington,et al.  Theory of Ordinary Differential Equations , 1955 .

[16]  Gaurav S. Sukhatme,et al.  Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning , 2017, ICML.

[17]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[18]  Ian M. Mitchell The Flexible, Extensible and Efficient Toolbox of Level Set Methods , 2008, J. Sci. Comput..

[19]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[20]  Marco Pavone,et al.  Robust online motion planning via contraction theory and convex optimization , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[22]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[23]  Matthias Althoff,et al.  An Introduction to CORA 2015 , 2015, ARCH@CPSWeek.

[24]  Mahesh Viswanathan,et al.  Automatic Reachability Analysis for Nonlinear Hybrid Models with C2E2 , 2016, CAV.

[25]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[26]  Atil Iscen,et al.  Sim-to-Real: Learning Agile Locomotion For Quadruped Robots , 2018, Robotics: Science and Systems.

[27]  Marco Pavone,et al.  Meta-Learning Priors for Efficient Online Bayesian Regression , 2018, WAFR.

[28]  Claire J. Tomlin,et al.  Design of guaranteed safe maneuvers using reachable sets: Autonomous quadrotor aerobatics in theory and practice , 2010, 2010 IEEE International Conference on Robotics and Automation.

[29]  Ian M. Mitchell,et al.  Overapproximating Reachable Sets by Hamilton-Jacobi Projections , 2003, J. Sci. Comput..

[30]  Silvio Savarese,et al.  ADAPT: Zero-Shot Adaptive Policy Transfer for Stochastic Dynamical Systems , 2017, ISRR.

[31]  Michiel van de Panne,et al.  Curriculum Learning for Motor Skills , 2012, Canadian Conference on AI.

[32]  Mo Chen,et al.  Decomposition of Reachable Sets and Tubes for a Class of Nonlinear Systems , 2016, IEEE Transactions on Automatic Control.

[33]  Pieter Abbeel,et al.  Learning Robotic Assembly from CAD , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[34]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[35]  S. Zagatti On viscosity solutions of Hamilton-Jacobi equations , 2008 .

[36]  Steven M. LaValle,et al.  Planning algorithms , 2006 .

[37]  Zheng Wen,et al.  Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[38]  Mo Chen,et al.  Fast reachable set approximations via state decoupling disturbances , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[39]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[40]  Pieter Abbeel,et al.  Reverse Curriculum Generation for Reinforcement Learning , 2017, CoRL.

[41]  Mo Chen,et al.  Hamilton-Jacobi reachability: A brief overview and recent advances , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[42]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[43]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.