Making Efficient Use of Demonstrations to Solve Hard Exploration Problems

This paper introduces R2D3, an agent that makes efficient use of demonstrations to solve hard exploration problems in partially observable environments with highly variable initial conditions. We also introduce a suite of eight tasks that combine these three properties, and show that R2D3 can solve several of the tasks where other state of the art methods (both with and without demonstrations) fail to see even a single successful trajectory after tens of billions of steps of exploration.

[1]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[2]  Rémi Munos,et al.  Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[3]  Rouhollah Rahmatizadeh,et al.  Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-to-End Learning from Demonstration , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[4]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[5]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[6]  Sergey Levine,et al.  Divide-and-Conquer Reinforcement Learning , 2017, ICLR.

[7]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[8]  Marc G. Bellemare,et al.  Investigating Contingency Awareness Using Atari 2600 Games , 2012, AAAI.

[9]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[10]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[11]  Nando de Freitas,et al.  Playing hard exploration games by watching YouTube , 2018, NeurIPS.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[14]  Marlos C. Machado,et al.  Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment , 2019, ArXiv.

[15]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[16]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[17]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[18]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[19]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[20]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Sergio Gomez Colmenarejo,et al.  One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL , 2018, ArXiv.

[22]  Yoshua Bengio,et al.  Reinforced Imitation in Heterogeneous Action Space , 2019, ArXiv.

[23]  Julian Togelius,et al.  Obstacle Tower: A Generalization Challenge in Vision, Control, and Planning , 2019, IJCAI.

[24]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[25]  Travis E. Oliphant,et al.  Guide to NumPy , 2015 .

[26]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[27]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[28]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[29]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[30]  Pieter Abbeel,et al.  Benchmarking Model-Based Reinforcement Learning , 2019, ArXiv.

[31]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[32]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[33]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[34]  Tim Salimans,et al.  Learning Montezuma's Revenge from a Single Demonstration , 2018, ArXiv.

[35]  Joelle Pineau,et al.  Learning from Limited Demonstrations , 2013, NIPS.

[36]  Sergey Levine,et al.  DeepMimic , 2018, ACM Trans. Graph..

[37]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[38]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[39]  Yuval Tassa,et al.  Learning human behaviors from motion capture by adversarial imitation , 2017, ArXiv.

[40]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[41]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[42]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.