Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective

While reinforcement learning (RL) methods that learn an internal model of the environment have the potential to be more sample efficient than their model-free counterparts, learning to model raw observations from high dimensional sensors can be challenging. Prior work has addressed this challenge by learning lowdimensional representation of observations through auxiliary objectives, such as reconstruction or value prediction. However, the alignment between these auxiliary objectives and the RL objective is often unclear. In this work, we propose a single objective which jointly optimizes a latent-space model and policy to achieve high returns while remaining self-consistent. This objective is a lower bound on expected returns. Unlike prior bounds for model-based RL on policy exploration or model guarantees, our bound is directly on the overall RL objective. We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods. While such sample efficient methods typically are computationally demanding, our method attains the performance of SAC in about 50% less wall-clock time.1

[1]  S. Levine,et al.  INFOrmation Prioritization through EmPOWERment in Visual Model-Based RL , 2022, ICLR.

[2]  Amir-massoud Farahmand,et al.  Value Gradient weighted Model-Based Reinforcement Learning , 2022, International Conference on Learning Representations.

[3]  Xiaolong Wang,et al.  Temporal Difference Learning for Model Predictive Control , 2022, ICML.

[4]  M. Maximo,et al.  A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems , 2022, IEEE transactions on neural networks and learning systems.

[5]  Ingook Jang,et al.  DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with Prototypical Representations , 2021, ICML.

[6]  Alessandro Lazaric,et al.  Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning , 2021, ICLR.

[7]  Rishabh Agarwal,et al.  Control-Oriented Model-Based Reinforcement Learning with Implicit Differentiation , 2021, AAAI.

[8]  Provable RL with Exogenous Distractors via Multistep Inverse Dynamics , 2021, ArXiv.

[9]  Sergey Levine,et al.  Mismatched No More: Joint Model-Policy Optimization for Model-Based RL , 2021, ArXiv.

[10]  Stefano Ermon,et al.  Temporal Predictive Coding For Model-Based Planning In Latent Space , 2021, ICML.

[11]  Sergey Levine,et al.  Which Mutual-Information Representation Learning Objectives are Sufficient for Control? , 2021, NeurIPS.

[12]  Satinder Singh,et al.  Reward is enough for convex MDPs , 2021, NeurIPS.

[13]  Aaron M. Dollar,et al.  Model Predictive Actor-Critic: Accelerating Robot Skill Acquisition with Deep Reinforcement Learning , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Che Wang,et al.  Randomized Ensembled Double Q-Learning: Learning Fast Without a Model , 2021, ICLR.

[15]  Florian Shkurti,et al.  Latent Skill Planning for Exploration and Transfer , 2020, ICLR.

[16]  Li Liu,et al.  A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges , 2020, Inf. Fusion.

[17]  Jessica B. Hamrick,et al.  On the role of planning in model-based deep reinforcement learning , 2020, ICLR.

[18]  Mohammad Norouzi,et al.  Mastering Atari with Discrete World Models , 2020, ICLR.

[19]  Andrew Gordon Wilson,et al.  On the model-based stochastic value gradient for continuous reinforcement learning , 2020, L4DC.

[20]  T. Taniguchi,et al.  Dreaming: Model-based Reinforcement Learning by Latent Imagination without Reconstruction , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[21]  S. Levine,et al.  Learning Invariant Representations for Reinforcement Learning without Reconstruction , 2020, ICLR.

[22]  S. Levine,et al.  Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers , 2020, ICLR.

[23]  Satinder Singh,et al.  The Value Equivalence Principle for Model-Based Reinforcement Learning , 2020, NeurIPS.

[24]  S. Levine,et al.  γ-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction , 2020, ArXiv.

[25]  Weinan Zhang,et al.  Model-based Policy Optimization with Unsupervised Model Adaptation , 2020, NeurIPS.

[26]  Jackie Kay,et al.  Local Search for Policy Iteration in Continuous Control , 2020, ArXiv.

[27]  David Held,et al.  Learning Off-Policy with Online Planning , 2020, CoRL.

[28]  Erin J. Talvitie,et al.  Hallucinating Value: A Pitfall of Dyna-style Planning with Imperfect Environment Models , 2020, ArXiv.

[29]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[30]  Pieter Abbeel,et al.  Model-Augmented Actor-Critic: Backpropagating through Paths , 2020, ICLR.

[31]  Vikash Kumar,et al.  A Game Theoretic Framework for Model Based Reinforcement Learning , 2020, ICML.

[32]  M. Ghavamzadeh,et al.  Policy-Aware Model Learning for Policy Gradient Methods , 2020, ArXiv.

[33]  Roberto Calandra,et al.  Objective Mismatch in Model-based Reinforcement Learning , 2020, L4DC.

[34]  Jimmy Ba,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[35]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[36]  Akshay Krishnamurthy,et al.  Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning , 2019, ICML.

[37]  Marcello Restelli,et al.  Gradient-Aware Model-based Policy Search , 2019, AAAI.

[38]  Sergey Levine,et al.  Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model , 2019, NeurIPS.

[39]  Sergey Levine,et al.  Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives , 2019, ICLR.

[40]  Jimmy Ba,et al.  Exploring Model-based Planning with Policy Networks , 2019, ICLR.

[41]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[42]  Pieter Abbeel,et al.  Benchmarking Model-Based Reinforcement Learning , 2019, ArXiv.

[43]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[44]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[45]  Yuandong Tian,et al.  Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[46]  Byron Boots,et al.  Differentiable MPC for End-to-end Planning and Control , 2018, NeurIPS.

[47]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[48]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[49]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[50]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[51]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[52]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[53]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[54]  Fabio Viola,et al.  Learning and Querying Fast Generative Models for Reinforcement Learning , 2018, ArXiv.

[55]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[56]  Amir-massoud Farahmand,et al.  Iterative Value-Aware Model Learning , 2018, NeurIPS.

[57]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[58]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[59]  Luca Rigazio,et al.  Path Integral Networks: End-to-End Differentiable Optimal Control , 2017, ArXiv.

[60]  Daniel Nikovski,et al.  Value-Aware Loss Function for Model-based Reinforcement Learning , 2017, AISTATS.

[61]  Sergey Levine,et al.  Learning Invariant Feature Spaces to Transfer Skills with Reinforcement Learning , 2017, ICLR.

[62]  MODEL-ENSEMBLE TRUST-REGION POLICY OPTI- , 2017 .

[63]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[64]  Pieter Abbeel,et al.  Value Iteration Networks , 2016, NIPS.

[65]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[66]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[67]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[68]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[69]  Alborz Geramifard,et al.  Reinforcement learning with misspecified model classes , 2013, 2013 IEEE International Conference on Robotics and Automation.

[70]  M. Botvinick,et al.  Planning as inference , 2012, Trends in Cognitive Sciences.

[71]  Vicenç Gómez,et al.  Optimal control as a graphical model inference problem , 2009, Machine Learning.

[72]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[73]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[74]  Marc Toussaint,et al.  Robot trajectory optimization using approximate inference , 2009, ICML '09.

[75]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[76]  Hagai Attias,et al.  Planning by Probabilistic Inference , 2003, AISTATS.

[77]  Sebastian Thrun,et al.  Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[78]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[79]  Manfred Morari,et al.  Model predictive control: Theory and practice - A survey , 1989, Autom..