Reverse Curriculum Generation for Reinforcement Learning

Many relevant tasks require an agent to reach a certain state, or to manipulate objects into a desired configuration. For example, we might want a robot to align and assemble a gear onto an axle or insert and turn a key in a lock. These goal-oriented tasks present a considerable challenge for reinforcement learning, since their natural reward function is sparse and prohibitive amounts of exploration are required to reach the goal and receive some learning signal. Past approaches tackle these problems by exploiting expert demonstrations or by manually designing a task-specific reward shaping function to guide the learning agent. Instead, we propose a method to learn these tasks without requiring any prior knowledge other than obtaining a single state in which the task is achieved. The robot is trained in reverse, gradually learning to reach the goal from a set of start states increasingly far from the goal. Our method automatically generates a curriculum of start states that adapts to the agent's performance, leading to efficient training on goal-oriented tasks. We demonstrate our approach on difficult simulated navigation and fine-grained manipulation problems, not solvable by state-of-the-art reinforcement learning methods.

[1]  Ira Sheldon Pohl,et al.  Bi-directional and heuristic search in path problems , 1969 .

[2]  Steven D. Whitehead,et al.  Complexity and Cooperation in Q-Learning , 1991, ML.

[3]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[5]  Rajeev Motwani,et al.  Path planning in expansive configuration spaces , 1997, Proceedings of International Conference on Robotics and Automation.

[6]  Daniel E. Koditschek,et al.  Sequential Composition of Dynamically Dexterous Robot Behaviors , 1999, Int. J. Robotics Res..

[7]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[8]  Steven M. LaValle,et al.  RRT-connect: An efficient approach to single-query path planning , 2000, Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065).

[9]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[10]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[11]  Minoru Asada,et al.  Purposive behavior acquisition for a real robot by vision-based reinforcement learning , 1995, Machine Learning.

[12]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  András Lörincz,et al.  Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[15]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[16]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[17]  Ian R. Manchester,et al.  LQR-trees: Feedback Motion Planning via Sums-of-Squares Verification , 2010, Int. J. Robotics Res..

[18]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[19]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[20]  Jürgen Schmidhuber,et al.  Continually adding self-invented problems to the repertoire: First experiments with POWERPLAY , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[21]  Michiel van de Panne,et al.  Curriculum Learning for Motor Skills , 2012, Canadian Conference on AI.

[22]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]  Pierre-Yves Oudeyer,et al.  Active learning of inverse models with intrinsically motivated goal exploration in robots , 2013, Robotics Auton. Syst..

[24]  Jürgen Schmidhuber,et al.  PowerPlay: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem , 2011, Front. Psychol..

[25]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[26]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[27]  Shiguang Shan,et al.  Self-Paced Curriculum Learning , 2015, AAAI.

[28]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[29]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[30]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[31]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[32]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[33]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[34]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[35]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[36]  Andrea Lockerd Thomaz,et al.  Exploration from Demonstration for Interactive Reinforcement Learning , 2016, AAMAS.

[37]  Balaraman Ravindran,et al.  Online Multi-Task Learning Using Biased Sampling , 2017 .

[38]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[39]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[40]  Sham M. Kakade,et al.  Towards Generalization and Simplicity in Continuous Control , 2017, NIPS.

[41]  Yuval Tassa,et al.  Data-efficient Deep Reinforcement Learning for Dexterous Manipulation , 2017, ArXiv.

[42]  Pieter Abbeel,et al.  Automatic Goal Generation for Reinforcement Learning Agents , 2017, ICML.

[43]  Ilya Kostrikov,et al.  Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play , 2017, ICLR.