SOLVE HARD EXPLORATION PROBLEMS

This paper introduces R2D3, an agent that makes efficient use of demonstrations to solve hard exploration problems in partially observable environments with highly variable initial conditions. We also introduce a suite of eight tasks that combine these three properties, and show that R2D3 can solve several of the tasks where other state of the art methods (both with and without demonstrations) fail to see even a single successful trajectory after tens of billions of steps of exploration.

[1]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[2]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[3]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[4]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[5]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[6]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[7]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[8]  Marc G. Bellemare,et al.  Investigating Contingency Awareness Using Atari 2600 Games , 2012, AAAI.

[9]  Joelle Pineau,et al.  Learning from Limited Demonstrations , 2013, NIPS.

[10]  Travis E. Oliphant,et al.  Guide to NumPy , 2015 .

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[13]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[14]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[15]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[16]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[17]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[18]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[19]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[20]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[21]  Stefano Ermon,et al.  InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations , 2017, NIPS.

[22]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[23]  Yuval Tassa,et al.  Learning human behaviors from motion capture by adversarial imitation , 2017, ArXiv.

[24]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[25]  Sergey Levine,et al.  DeepMimic , 2018, ACM Trans. Graph..

[26]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[27]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[28]  Tim Salimans,et al.  Learning Montezuma's Revenge from a Single Demonstration , 2018, ArXiv.

[29]  Nando de Freitas,et al.  Playing hard exploration games by watching YouTube , 2018, NeurIPS.

[30]  Sergio Gomez Colmenarejo,et al.  One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL , 2018, ArXiv.

[31]  Sergey Levine,et al.  Divide-and-Conquer Reinforcement Learning , 2017, ICLR.

[32]  Alexander Novikov,et al.  Visual Imitation with a Minimal Adversary , 2018 .

[33]  Rouhollah Rahmatizadeh,et al.  Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-to-End Learning from Demonstration , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[34]  Rémi Munos,et al.  Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[35]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[36]  Jiashi Feng,et al.  Policy Optimization with Demonstrations , 2018, ICML.

[37]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[38]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[39]  Yoshua Bengio,et al.  Reinforced Imitation in Heterogeneous Action Space , 2019, ArXiv.

[40]  Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning , 2018, ICLR.

[41]  Pieter Abbeel,et al.  Benchmarking Model-Based Reinforcement Learning , 2019, ArXiv.

[42]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[43]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[44]  Jian Peng,et al.  Learning Belief Representations for Imitation Learning in POMDPs , 2019, UAI.

[45]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[46]  Julian Togelius,et al.  Obstacle Tower: A Generalization Challenge in Vision, Control, and Planning , 2019, IJCAI.

[47]  Marlos C. Machado,et al.  Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment , 2019, ArXiv.

[48]  Misha Denil,et al.  Task-Relevant Adversarial Imitation Learning , 2019, CoRL.