Reinforcement Learning with Unsupervised Auxiliary Tasks

Deep reinforcement learning agents have achieved state-of-the-art results by directly maximising cumulative reward. However, environments contain a much wider variety of possible training signals. In this paper, we introduce an agent that also maximises many other pseudo-reward functions simultaneously by reinforcement learning. All of these tasks share a common representation that, like unsupervised learning, continues to develop in the absence of extrinsic rewards. We also introduce a novel mechanism for focusing this representation upon extrinsic rewards, so that learning can rapidly adapt to the most relevant aspects of the actual task. Our agent significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10$\times$ and averaging 87\% expert human performance on Labyrinth.

[1]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[2]  Wojciech Jaskowski,et al.  ViZDoom: A Doom-based AI research platform for visual reinforcement learning , 2016, 2016 IEEE Conference on Computational Intelligence and Games (CIG).

[3]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[4]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[5]  Jianfeng Gao,et al.  Recurrent Reinforcement Learning: A Hybrid Approach , 2015, ArXiv.

[6]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[7]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[8]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[9]  Long Lin,et al.  Memory Approaches to Reinforcement Learning in Non-Markovian Domains , 1992 .

[10]  R. N. Spreng,et al.  The Future of Memory: Remembering, Imagining, and the Brain , 2012, Neuron.

[11]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[12]  Andrew G. Barto,et al.  Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining , 2009, NIPS.

[13]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[14]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[15]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[16]  Samuel Gershman,et al.  Deep Successor Reinforcement Learning , 2016, ArXiv.

[17]  Shie Mannor,et al.  Graying the black box: Understanding DQNs , 2016, ICML.

[18]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[19]  David Silver,et al.  Compositional Planning Using Optimal Option Models , 2012, ICML.

[20]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[21]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[22]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[23]  Guillaume Lample,et al.  Playing FPS Games with Deep Reinforcement Learning , 2016, AAAI.

[24]  Jürgen Schmidhuber,et al.  Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[25]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[26]  Sergey Levine,et al.  Model-based reinforcement learning with parametrized physical models and optimism-driven exploration , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[27]  Honglak Lee,et al.  Control of Memory, Active Perception, and Action in Minecraft , 2016, ICML.

[28]  Shie Mannor,et al.  A Deep Hierarchical Approach to Lifelong Learning in Minecraft , 2016, AAAI.

[29]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[30]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[31]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.