Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model

Deep reinforcement learning (RL) algorithms can use high-capacity deep networks to learn directly from image observations. However, these high-dimensional observation spaces present a number of challenges in practice, since the policy must now solve two problems: representation learning and task learning. In this work, we tackle these two problems separately, by explicitly learning latent representations that can accelerate reinforcement learning from images. We propose the stochastic latent actor-critic (SLAC) algorithm: a sample-efficient and high-performing RL algorithm for learning policies for complex continuous control tasks directly from high-dimensional image inputs. SLAC provides a novel and principled approach for unifying stochastic sequential models and RL into a single method, by learning a compact latent representation and then performing RL in the model's learned latent space. Our experimental evaluation demonstrates that our method outperforms both model-free and model-based alternatives in terms of final performance and sample efficiency, on a range of difficult image-based control tasks. Our code and videos of our results are available at our website.

[1]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[2]  Sergey Levine,et al.  Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[4]  Fabio Viola,et al.  Learning and Querying Fast Generative Models for Reinforcement Learning , 2018, ArXiv.

[5]  Trevor Darrell,et al.  Loss is its own Reward: Self-Supervision for Reinforcement Learning , 2016, ICLR.

[6]  Il Memming Park,et al.  BLACK BOX VARIATIONAL INFERENCE FOR STATE SPACE MODELS , 2015, 1511.07367.

[7]  Sergey Levine,et al.  SOLAR: Deep Structured Latent Representations for Model-Based Reinforcement Learning , 2018, ArXiv.

[8]  Marc G. Bellemare,et al.  DeepMDP: Learning Continuous Latent Space Models for Representation Learning , 2019, ICML.

[9]  Aaron van den Oord,et al.  Shaping Belief States with Generative Environment Models for RL , 2019, NeurIPS.

[10]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[11]  Thomas B. Schön,et al.  From Pixels to Torques: Policy Learning with Deep Dynamical Models , 2015, ICML 2015.

[12]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[13]  Pascal Poupart,et al.  On Improving Deep Reinforcement Learning for POMDPs , 2017, ArXiv.

[14]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[15]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[16]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Sergey Levine,et al.  Deep spatial autoencoders for visuomotor learning , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[19]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[20]  Duy Nguyen-Tuong,et al.  Probabilistic Recurrent State-Space Models , 2018, ICML.

[21]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[22]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state information , 1965 .

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Martin A. Riedmiller,et al.  Deep auto-encoder neural networks in reinforcement learning , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[25]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[26]  Yuval Tassa,et al.  Relative Entropy Regularized Policy Iteration , 2018, ArXiv.

[27]  Ole Winther,et al.  A Disentangled Recognition and Nonlinear Dynamics Model for Unsupervised Learning , 2017, NIPS.

[28]  Yoshua Bengio,et al.  Variational Temporal Abstraction , 2019, NeurIPS.

[29]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[30]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[31]  Alexander A. Alemi,et al.  Fixing a Broken ELBO , 2017, ICML.

[32]  Maximilian Karl,et al.  Deep Variational Bayes Filters: Unsupervised Learning of State Space Models from Raw Data , 2016, ICLR.

[33]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[34]  Christopher Burgess,et al.  DARLA: Improving Zero-Shot Transfer in Reinforcement Learning , 2017, ICML.

[35]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[36]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[37]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[38]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[39]  Jürgen Schmidhuber,et al.  World Models , 2018, ArXiv.

[40]  Sergey Levine,et al.  Visual Reinforcement Learning with Imagined Goals , 2018, NeurIPS.

[41]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[42]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[43]  Shimon Whiteson,et al.  Deep Variational Reinforcement Learning for POMDPs , 2018, ICML.

[44]  Patrick van der Smagt,et al.  Unsupervised Real-Time Control Through Variational Empowerment , 2017, ISRR.

[45]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[46]  Mohammad Norouzi,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[47]  Ole Winther,et al.  BIVA: A Very Deep Hierarchy of Latent Variables for Generative Modeling , 2019, NeurIPS.

[48]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[49]  Ole Winther,et al.  Sequential Neural Models with Stochastic Layers , 2016, NIPS.

[50]  Uri Shalit,et al.  Deep Kalman Filters , 2015, ArXiv.

[51]  Ilya Kostrikov,et al.  Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels , 2020, ArXiv.

[52]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[53]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[54]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[55]  Sergey Levine,et al.  SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning , 2018, ICML.

[56]  Nicolas Le Roux,et al.  The Value Function Polytope in Reinforcement Learning , 2019, ICML.

[57]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[58]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[59]  Henry Zhu,et al.  Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[60]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.