Bootstrap Latent-Predictive Representations for Multitask Reinforcement Learning

Learning a good representation is an essential component for deep reinforcement learning (RL). Representation learning is especially important in multitask and partially observable settings where building a representation of the unknown environment is crucial to solve the tasks. Here we introduce Prediction of Bootstrap Latents (PBL), a simple and flexible self-supervised representation learning algorithm for multitask deep RL. PBL builds on multistep predictive representations of future observations, and focuses on capturing structured information about environment dynamics. Specifically, PBL trains its representation by predicting latent embeddings of future observations. These latent embeddings are themselves trained to be predictive of the aforementioned representations. These predictions form a bootstrapping effect, allowing the agent to learn more about the key aspects of the environment dynamics. In addition, by defining prediction tasks completely in latent space, PBL provides the flexibility of using multimodal observations involving pixel images, language instructions, rewards and more. We show in our experiments that PBL delivers across-the-board improved performance over state of the art deep RL agents in the DMLab-30 and Atari-57 multitask setting.

[1]  Wojciech Czarnecki,et al.  Multi-task Deep Reinforcement Learning with PopArt , 2018, AAAI.

[2]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[3]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[4]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[5]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[7]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[9]  Joelle Pineau,et al.  Combined Reinforcement Learning via Abstract Representations , 2018, AAAI.

[10]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[11]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[12]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[15]  Tamim Asfour,et al.  ProMP: Proximal Meta-Policy Search , 2018, ICLR.

[16]  Rémi Munos,et al.  World Discovery Models , 2019, ArXiv.

[17]  Nando de Freitas,et al.  Playing hard exploration games by watching YouTube , 2018, NeurIPS.

[18]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[19]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[20]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[21]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[24]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[25]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[26]  Peter Stone,et al.  Learning Predictive State Representations , 2003, ICML.

[27]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[28]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[29]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[30]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[31]  Lihong Li,et al.  Sample Complexity of Multi-task Reinforcement Learning , 2013, UAI.

[32]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[33]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[34]  Daniel Guo,et al.  Never Give Up: Learning Directed Exploration Strategies , 2020, ICLR.

[35]  Marc G. Bellemare,et al.  DeepMDP: Learning Continuous Latent Space Models for Representation Learning , 2019, ICML.

[36]  Aaron van den Oord,et al.  Shaping Belief States with Generative Environment Models for RL , 2019, NeurIPS.

[37]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[38]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[39]  Sergey Levine,et al.  SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning , 2018, ICML.

[40]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[41]  Rémi Munos,et al.  Neural Predictive Belief Representations , 2018, ArXiv.

[42]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[43]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[44]  Misha Denil,et al.  Learning Awareness Models , 2018, ICLR.

[45]  H. Francis Song,et al.  V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , 2019, ICLR.

[46]  N. Heess,et al.  Neural belief states for partially observed domains , 2018 .