Automated curriculum generation for Policy Gradients from Demonstrations

In this paper, we present a technique that improves the process of training an agent (using RL) for instruction following. We develop a training curriculum that uses a nominal number of expert demonstrations and trains the agent in a manner that draws parallels from one of the ways in which humans learn to perform complex tasks, i.e by starting from the goal and working backwards. We test our method on the BabyAI platform and show an improvement in sample efficiency for some of its tasks compared to a PPO (proximal policy optimization) baseline.

[1]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[2]  Peter Stone,et al.  Automatic Curriculum Graph Generation for Reinforcement Learning Agents , 2017, AAAI.

[3]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[4]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[5]  Wei Xu,et al.  Interactive Grounded Language Acquisition and Generalization in a 2D World , 2018, ICLR.

[6]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[7]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[8]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[9]  Demis Hassabis,et al.  Grounded Language Learning in a Simulated 3D World , 2017, ArXiv.

[10]  John Schulman,et al.  Teacher–Student Curriculum Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Tom Schaul,et al.  Learning from Demonstrations for Real World Reinforcement Learning , 2017, ArXiv.

[12]  Felipe Leno da Silva,et al.  Object-Oriented Curriculum Generation for Reinforcement Learning , 2018, AAMAS.

[13]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Jürgen Schmidhuber,et al.  Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[15]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[16]  Joan Bruna,et al.  Backplay: "Man muss immer umkehren" , 2018, ArXiv.

[17]  Pieter Abbeel,et al.  Reverse Curriculum Generation for Reinforcement Learning , 2017, CoRL.

[18]  Peter Stone,et al.  Autonomous Task Sequencing for Customized Curriculum Design in Reinforcement Learning , 2017, IJCAI.