When to Trust Your Model: Model-Based Policy Optimization

Designing effective model-based reinforcement learning algorithms is difficult because the ease of data generation must be weighed against the bias of model-generated data. In this paper, we study the role of model usage in policy optimization both theoretically and empirically. We first formulate and analyze a model-based reinforcement learning algorithm with a guarantee of monotonic improvement at each step. In practice, this analysis is overly pessimistic and suggests that real off-policy data is always preferable to model-generated on-policy data, but we show that an empirical estimate of model generalization can be incorporated into such analysis to justify model usage. Motivated by this analysis, we then demonstrate that a simple procedure of using short model-generated rollouts branched from real data has the benefits of more complicated model-based algorithms without the usual pitfalls. In particular, this approach surpasses the sample efficiency of prior model-based methods, matches the asymptotic performance of the best model-free algorithms, and scales to horizons that cause other model-based methods to fail entirely.

[1]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[2]  Sebastian Engell,et al.  Model Predictive Control Using Neural Networks [25 Years Ago] , 1995, IEEE Control Systems.

[3]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[4]  Stefan Schaal,et al.  Learning tasks from a single demonstration , 1997, Proceedings of International Conference on Robotics and Automation.

[5]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[6]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[7]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[9]  Erik Talvitie,et al.  Model Regularization for Stable Sample Rollouts , 2014, UAI.

[10]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[11]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[12]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[13]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[14]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[15]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[16]  C. Rasmussen,et al.  Improving PILCO with Bayesian Neural Network Dynamics Models , 2016 .

[17]  Sergey Levine,et al.  Optimal control with learned local models: Application to dexterous manipulation , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[18]  Pieter Abbeel,et al.  Value Iteration Networks , 2016, NIPS.

[19]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[20]  Daniel Nikovski,et al.  Value-Aware Loss Function for Model-based Reinforcement Learning , 2017, AISTATS.

[21]  Finale Doshi-Velez,et al.  Learning and Policy Search in Stochastic Dynamical Systems with Bayesian Neural Networks , 2016, ICLR.

[22]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[23]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[24]  Gabriel Kalweit,et al.  Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning , 2017, CoRL.

[25]  Balaraman Ravindran,et al.  EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[26]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[27]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[28]  Erik Talvitie,et al.  Self-Correcting Models for Model-Based Reinforcement Learning , 2016, AAAI.

[29]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[30]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[31]  Byron Boots,et al.  Dual Policy Iteration , 2018, NeurIPS.

[32]  Sergey Levine,et al.  Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control , 2018, ArXiv.

[33]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[34]  Erik Talvitie,et al.  The Effect of Planning Shape on Dyna-style Planning in High-dimensional State Spaces , 2018, ArXiv.

[35]  Rob Fergus,et al.  Understanding the Asymptotic Performance of Model-Based RL Methods , 2018 .

[36]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[37]  Tamim Asfour,et al.  Model-Based Reinforcement Learning via Meta-Policy Optimization , 2018, CoRL.

[38]  Sergey Levine,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[39]  Kavosh Asadi,et al.  Lipschitz Continuity in Model-based Reinforcement Learning , 2018, ICML.

[40]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[41]  Yilun Du,et al.  Task-Agnostic Dynamics Priors for Deep Reinforcement Learning , 2019, ICML.

[42]  Yoshua Bengio,et al.  Probabilistic Planning with Sequential Monte Carlo methods , 2018, ICLR.

[43]  Yuandong Tian,et al.  Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[44]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[45]  Nikolai Matni,et al.  On the Sample Complexity of the Linear Quadratic Regulator , 2017, Foundations of Computational Mathematics.