Never Give Up: Learning Directed Exploration Strategies

We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies. We construct an episodic memorybased intrinsic reward using k-nearest neighbors over the agent’s recent experience to train the directed exploratory policies, thereby encouraging the agent to repeatedly revisit all states in its environment. A self-supervised inverse dynamics model is used to train the embeddings of the nearest neighbour lookup, biasing the novelty signal towards what the agent can control. We employ the framework of Universal Value Function Approximators (UVFA) to simultaneously learn many directed exploration policies with the same neural network, with different trade-offs between exploration and exploitation. By using the same neural network for different degrees of exploration/exploitation, transfer is demonstrated from predominantly exploratory policies yielding effective exploitative policies. The proposed method can be incorporated to run with modern distributed RL agents that collect large amounts of experience from many actors running in parallel on separate environment instances. Our method doubles the performance of the base agent in all hard exploration in the Atari-57 suite while maintaining a very high score across the remaining games, obtaining a median human normalised score of 1344.0%. Notably, the proposed method is the first algorithm to achieve non-zero rewards (with a mean score of 8,400) in the game of Pitfall! without using demonstrations or hand-crafted features.

[1]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[2]  David Warde-Farley,et al.  Unsupervised Control Through Non-Parametric Discriminative Rewards , 2018, ICLR.

[3]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[4]  Daniel L. K. Yamins,et al.  Learning to Play with Intrinsically-Motivated Self-Aware Agents , 2018, NeurIPS.

[5]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[6]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[7]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[8]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[9]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[10]  Demis Hassabis,et al.  Neural Episodic Control , 2017, ICML.

[11]  Lucas Beyer,et al.  MULEX: Disentangling Exploitation from Exploration in Deep RL , 2019, ArXiv.

[12]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[13]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[14]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[15]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[16]  Sergey Levine,et al.  EMI: Exploration with Mutual Information Maximizing State and Action Embeddings , 2018, ArXiv.

[17]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[18]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[19]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[20]  Jeff Clune,et al.  Deep Curiosity Search: Intra-Life Exploration Improves Performance on Challenging Deep Reinforcement Learning Problems , 2018, ArXiv.

[21]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[22]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[24]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[25]  Rémi Munos,et al.  Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[26]  Joel Z. Leibo,et al.  Model-Free Episodic Control , 2016, ArXiv.

[27]  Honglak Lee,et al.  Contingency-Aware Exploration in Reinforcement Learning , 2018, ICLR.

[28]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[29]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[30]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[31]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[32]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[33]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[34]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[35]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[36]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[37]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[38]  Marc Pollefeys,et al.  Episodic Curiosity through Reachability , 2018, ICLR.