Mismatched No More: Joint Model-Policy Optimization for Model-Based RL

Many model-based reinforcement learning (RL) methods follow a similar template: fit a model to previously observed data, and then use data from that model for RL or planning. However, models that achieve better training performance (e.g., lower MSE) are not necessarily better for control: an RL agent may seek out the small fraction of states where an accurate model makes mistakes, or it might act in ways that do not expose the errors of an inaccurate model. As noted in prior work, there is an objective mismatch: models are useful if they yield good policies, but they are trained to maximize their accuracy, rather than the performance of the policies that result from them. In this work we propose a single objective for jointly training the model and the policy, such that updates to either component increases a lower bound on expected return. This joint optimization mends the objective mismatch in prior work. Our objective is a global lower bound on expected return, and this bound becomes tight under certain assumptions. The resulting algorithm (MnM) is conceptually similar to a GAN: a classifier distinguishes between real and fake transitions, the model is updated to produce transitions that look realistic, and the policy is updated to avoid states where the model predictions are unrealistic.

[1]  Hongning Wang,et al.  Model-Based Reinforcement Learning with Adversarial Training for Online Recommendation , 2019, NeurIPS.

[2]  Richard L. Lewis,et al.  Internal Rewards Mitigate Agent Boundedness , 2010, ICML.

[3]  Sergey Levine,et al.  C-Learning: Learning to Achieve Goals via Recursive Classification , 2020, ICLR.

[4]  Trevor Darrell,et al.  Adversarial Feature Learning , 2016, ICLR.

[5]  Vikash Kumar,et al.  A Game Theoretic Framework for Model Based Reinforcement Learning , 2020, ICML.

[6]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[7]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[8]  Kavosh Asadi,et al.  Combating the Compounding-Error Problem with a Multi-step Model , 2019, ArXiv.

[9]  Yuandong Tian,et al.  Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[10]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[11]  Klaus Obermayer,et al.  Risk-Sensitive Reinforcement Learning , 2013, Neural Computation.

[12]  Byron Boots,et al.  Differentiable MPC for End-to-end Planning and Control , 2018, NeurIPS.

[13]  Sergey Levine,et al.  Outcome-Driven Reinforcement Learning via Variational Inference , 2021, NeurIPS.

[14]  J. Andrew Bagnell,et al.  Agnostic System Identification for Model-Based Reinforcement Learning , 2012, ICML.

[15]  Laurent El Ghaoui,et al.  Robustness in Markov Decision Problems with Uncertain Transition Matrices , 2003, NIPS.

[16]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[17]  Roberto Calandra,et al.  Objective Mismatch in Model-based Reinforcement Learning , 2020, L4DC.

[18]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[19]  Andrew Y. Ng,et al.  Solving Uncertain Markov Decision Processes , 2001 .

[20]  Kavosh Asadi,et al.  Lipschitz Continuity in Model-based Reinforcement Learning , 2018, ICML.

[21]  Henry Zhu,et al.  ROBEL: Robotics Benchmarks for Learning with Low-Cost Robots , 2019, CoRL.

[22]  Sergey Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[23]  Thorsten Joachims,et al.  MOReL : Model-Based Offline Reinforcement Learning , 2020, NeurIPS.

[24]  Daniel Nikovski,et al.  Value-Aware Loss Function for Model-based Reinforcement Learning , 2017, AISTATS.

[25]  Yinlam Chow,et al.  Variational Model-based Policy Optimization , 2020, IJCAI.

[26]  Martial Hebert,et al.  Improved Learning of Dynamics Models for Control , 2016, ISER.

[27]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[28]  Erik Talvitie,et al.  Model Regularization for Stable Sample Rollouts , 2014, UAI.

[29]  Nolan Wagener,et al.  Information theoretic MPC for model-based reinforcement learning , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[30]  Alborz Geramifard,et al.  Reinforcement learning with misspecified model classes , 2013, 2013 IEEE International Conference on Robotics and Automation.

[31]  Pieter Abbeel,et al.  Learning Plannable Representations with Causal InfoGAN , 2018, NeurIPS.

[32]  Allan Jabri,et al.  Universal Planning Networks: Learning Generalizable Representations for Visuomotor Control , 2018, ICML.

[33]  Xiaofang Zhang,et al.  GAN-Based Planning Model in Deep Reinforcement Learning , 2020, ICANN.

[34]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[35]  Mohammad Norouzi,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[36]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[37]  Ruslan Salakhutdinov,et al.  Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers , 2020, ArXiv.

[38]  Lantao Yu,et al.  MOPO: Model-based Offline Policy Optimization , 2020, NeurIPS.