论文信息 - MOPO: Model-based Offline Policy Optimization

MOPO: Model-based Offline Policy Optimization

Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a batch of previously collected data. This problem setting is compelling, because it offers the promise of utilizing large, diverse, previously collected datasets to acquire policies without any costly or dangerous active exploration, but it is also exceptionally difficult, due to the distributional shift between the offline training data and the learned policy. While there has been significant progress in model-free offline RL, the most successful prior methods constrain the policy to the support of the data, precluding generalization to new states. In this paper, we observe that an existing model-based RL algorithm on its own already produces significant gains in the offline setting, as compared to model-free approaches, despite not being designed for this setting. However, although many standard model-based RL methods already estimate the uncertainty of their model, they do not by themselves provide a mechanism to avoid the issues associated with distributional shift in the offline setting. We therefore propose to modify existing model-based RL methods to address these issues by casting offline model-based RL into a penalized MDP framework. We theoretically show that, by using this penalized MDP, we are maximizing a lower bound of the return in the true MDP. Based on our theoretical results, we propose a new model-based offline RL algorithm that applies the variance of a Lipschitz-regularized model as a penalty to the reward function. We find that this algorithm outperforms both standard model-based RL methods and existing state-of-the-art model-free offline RL approaches on existing offline RL benchmarks, as well as two challenging continuous control tasks that require generalizing from data collected for a different task.

[1] D. Freedman,et al. Some Asymptotic Theory for the Bootstrap , 1981 .

[2] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[3] Richard S. Sutton,et al. Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[4] Robert F. Stengel,et al. Optimal Control and Estimation , 1994 .

[5] Sebastian Engell,et al. Model predictive control using neural networks , 1995 .

[6] A. Müller. Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[7] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[8] Bernhard Schölkopf,et al. A Kernel Approach to Comparing Distributions , 2007, AAAI.

[9] Michael L. Littman,et al. An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[10] Alborz Geramifard,et al. Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[11] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[12] Kenji Fukumizu,et al. On integral probability metrics, φ-divergences and binary classification , 2009, 0901.2698.

[13] Shalabh Bhatnagar,et al. Multi-Step Dyna Planning for Policy Evaluation and Control , 2009, NIPS.

[14] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[15] Marin Kobilarov,et al. Scalable Approach to Uncertainty Quantification and Robust Design of Interconnected Dynamical Systems , 2011, Annu. Rev. Control..

[16] Carl E. Rasmussen,et al. PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[17] Martha White,et al. Linear Off-Policy Actor-Critic , 2012, ICML.

[18] Martin A. Riedmiller,et al. Batch Reinforcement Learning , 2012, Reinforcement Learning.

[19] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[20] Sergey Levine,et al. Guided Policy Search , 2013, ICML.

[21] Kwang-Ki K. Kim,et al. Wiener's Polynomial Chaos for the Analysis and Control of Nonlinear Dynamical Systems with Probabilistic Uncertainties [Historical Perspectives] , 2013, IEEE Control Systems.

[22] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[23] Philip S. Thomas,et al. Safe Reinforcement Learning , 2015 .

[24] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[25] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[26] C. Rasmussen,et al. Improving PILCO with Bayesian Neural Network Dynamics Models , 2016 .

[27] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[28] Sergey Levine,et al. Optimal control with learned local models: Application to dexterous manipulation , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[29] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[30] Pieter Abbeel,et al. Value Iteration Networks , 2016, NIPS.

[31] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[32] Finale Doshi-Velez,et al. Learning and Policy Search in Stochastic Dynamical Systems with Bayesian Neural Networks , 2016, ICLR.

[33] Tom Schaul,et al. The Predictron: End-To-End Learning and Planning , 2016, ICML.

[34] Sergey Levine,et al. Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[35] Razvan Pascanu,et al. Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[36] Richard E. Turner,et al. Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning , 2017, NIPS.

[37] Charles Blundell,et al. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[38] Sergey Levine,et al. Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[39] Satinder Singh,et al. Value Prediction Network , 2017, NIPS.

[40] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[41] Pieter Abbeel,et al. Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[42] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[43] Sergey Levine,et al. Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control , 2018, ArXiv.

[44] Sergey Levine,et al. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[45] Stefano Ermon,et al. Accurate Uncertainties for Deep Learning Using Calibrated Regression , 2018, ICML.

[46] Erik Talvitie,et al. The Effect of Planning Shape on Dyna-style Planning in High-dimensional State Spaces , 2018, ArXiv.

[47] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[48] Yuichi Yoshida,et al. Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[49] Joelle Pineau,et al. A Dissection of Overfitting and Generalization in Continuous Reinforcement Learning , 2018, ArXiv.

[50] Tamim Asfour,et al. Model-Based Reinforcement Learning via Meta-Policy Optimization , 2018, CoRL.