Dream to Control: Learning Behaviors by Latent Imagination

To select effective actions in complex environments, intelligent agents need to generalize from past experience. World models can represent knowledge about the environment to facilitate such generalization. While learning world models from high-dimensional sensory inputs is becoming feasible through deep learning, there are many potential ways for deriving behaviors from them. We present Dreamer, a reinforcement learning agent that solves long-horizon tasks purely by latent imagination. We efficiently learn behaviors by backpropagating analytic gradients of learned state values through trajectories imagined in the compact state space of a learned world model. On 20 challenging visual control tasks, Dreamer exceeds existing approaches in data-efficiency, computation time, and final performance.

[1]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[2]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[3]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[4]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[5]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[6]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[9]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[10]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[11]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[12]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[13]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[14]  Uri Shalit,et al.  Deep Kalman Filters , 2015, ArXiv.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[17]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[18]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[19]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[20]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[21]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[22]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[23]  Maximilian Karl,et al.  Deep Variational Bayes Filters: Unsupervised Learning of State Space Models from Raw Data , 2016, ICLR.

[24]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[25]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[26]  Yann LeCun,et al.  Model-Based Planning with Discrete and Continuous Actions , 2017 .

[27]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[28]  Sergey Levine,et al.  Self-Supervised Visual Planning with Temporal Skip Connections , 2017, CoRL.

[29]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[30]  Yann LeCun,et al.  Model-Based Planning in Discrete Action Spaces , 2017, ArXiv.

[31]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[32]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[33]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[34]  Ali Ghodsi,et al.  Robust Locally-Linear Controllable Embedding , 2017, AISTATS.

[35]  Carl E. Rasmussen,et al.  PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos , 2019, ICML.

[36]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[37]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[38]  Duy Nguyen-Tuong,et al.  Probabilistic Recurrent State-Space Models , 2018, ICML.

[39]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[40]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[41]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[42]  Marc G. Bellemare,et al.  Dopamine: A Research Framework for Deep Reinforcement Learning , 2018, ArXiv.

[43]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[44]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[45]  Allan Jabri,et al.  Universal Planning Networks , 2018, ICML.

[46]  Sergey Levine,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[47]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[48]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[49]  Fabio Viola,et al.  Learning and Querying Fast Generative Models for Reinforcement Learning , 2018, ArXiv.

[50]  Rémi Munos,et al.  Neural Predictive Belief Representations , 2018, ArXiv.

[51]  Jürgen Schmidhuber,et al.  World Models , 2018, ArXiv.

[52]  Marc G. Bellemare,et al.  DeepMDP: Learning Continuous Latent Space Models for Representation Learning , 2019, ICML.

[53]  Yann LeCun,et al.  Model-Predictive Policy Learning with Uncertainty Regularization for Driving in Dense Traffic , 2019, ICLR.

[54]  Aaron van den Oord,et al.  Shaping Belief States with Generative Environment Models for RL , 2019, NeurIPS.

[55]  Alexander A. Alemi,et al.  On Variational Bounds of Mutual Information , 2019, ICML.

[56]  Pieter Abbeel,et al.  Benchmarking Model-Based Reinforcement Learning , 2019, ArXiv.

[57]  Sham M. Kakade,et al.  Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control , 2018, ICLR.

[58]  Martin A. Riedmiller,et al.  Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models , 2019, CoRL.

[59]  Michael S. Ryoo,et al.  Learning Real-World Robot Policies by Dreaming , 2018, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[60]  Sergey Levine,et al.  SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning , 2018, ICML.

[61]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[62]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[63]  Karl Stratos,et al.  Formal Limitations on the Measurement of Mutual Information , 2018, AISTATS.

[64]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[65]  Sergey Levine,et al.  Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model , 2019, NeurIPS.

[66]  Jimmy Ba,et al.  Exploring Model-based Planning with Policy Networks , 2019, ICLR.