论文信息 - Model-Based Reinforcement Learning via Meta-Policy Optimization

Model-Based Reinforcement Learning via Meta-Policy Optimization

Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models. Using an ensemble of learned dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model in the ensemble with one policy gradient step. This steers the meta-policy towards internalizing consistent dynamics predictions among the ensemble while shifting the burden of behaving optimally w.r.t. the model discrepancies towards the adaptation step. Our experiments show that MB-MPO is more robust to model imperfections than previous model-based approaches. Finally, we demonstrate that our approach is able to match the asymptotic performance of model-free methods while requiring significantly less experience.

[1] Richard S. Sutton,et al. Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[2] J. Doyle,et al. Robust and optimal control , 1995, Proceedings of 35th IEEE Conference on Decision and Control.

[3] Christian Goerick,et al. Fast learning for problem classes using knowledge based network initialization , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[4] Jeff G. Schneider,et al. Autonomous helicopter control using reinforcement learning policy search methods , 2001, Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164).

[5] Pieter Abbeel,et al. Using inaccurate models in reinforcement learning , 2006, ICML.

[6] Stefan Schaal,et al. Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[7] Duy Nguyen-Tuong,et al. Local Gaussian Process Regression for Real Time Online Model Learning , 2008, NIPS.

[8] Carl E. Rasmussen,et al. PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[9] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[10] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11] Jan Peters,et al. A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[12] Mi-Ching Tsai,et al. Robust and Optimal Control , 2014 .

[13] Sergey Levine,et al. Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[14] Thomas B. Schön,et al. From Pixels to Torques: Policy Learning with Deep Dynamical Models , 2015, ICML 2015.

[15] Yuval Tassa,et al. Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[16] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[17] Pieter Abbeel,et al. Deep learning helicopter dynamics models , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[18] Ross A. Knepper,et al. DeepMPC: Learning Deep Latent Features for Model Predictive Control , 2015, Robotics: Science and Systems.

[19] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[20] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[21] Marcin Andrychowicz,et al. Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[22] Tim Salimans,et al. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.