On Effective Scheduling of Model-based Reinforcement Learning

Model-based reinforcement learning has attracted wide attention due to its superior sample efficiency. Despite its impressive success so far, it is still unclear how to appropriately schedule the important hyperparameters to achieve adequate performance, such as the real data ratio for policy optimization in Dyna-style model-based algorithms. In this paper, we first theoretically analyze the role of real data in policy training, which suggests that gradually increasing the ratio of real data yields better performance. Inspired by the analysis, we propose a framework named AutoMBPO to automatically schedule the real data ratio as well as other hyperparameters in training model-based policy optimization (MBPO) algorithm, a representative running case of model-based methods. On several continuous control tasks, the MBPO instance trained with hyperparameters scheduled by AutoMBPO can significantly surpass the original one, and the real data ratio schedule found by AutoMBPO shows consistency with our theoretical analysis.

[1]  Xin Zhou,et al.  Intelligent Trainer for Dyna-Style Model-Based Deep Reinforcement Learning , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Yong Yu,et al.  Bidirectional Model-based Policy Optimization , 2020, ICML.

[3]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[4]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[5]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[6]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[7]  Nan Jiang,et al.  Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches , 2018, COLT.

[8]  Gabriel Kalweit,et al.  Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning , 2017, CoRL.

[9]  Kavosh Asadi,et al.  Lipschitz Continuity in Model-based Reinforcement Learning , 2018, ICML.

[10]  Pieter Abbeel,et al.  Benchmarking Model-Based Reinforcement Learning , 2019, ArXiv.

[11]  Frank Hutter,et al.  On the Importance of Hyperparameter Optimization for Model-based Reinforcement Learning , 2021, AISTATS.

[12]  Sergey Levine,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[13]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[14]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[15]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[16]  Balaraman Ravindran,et al.  EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[17]  Honglak Lee,et al.  Bridging Imagination and Reality for Model-Based Deep Reinforcement Learning , 2020, NeurIPS.

[18]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[19]  Tamim Asfour,et al.  Model-Based Reinforcement Learning via Meta-Policy Optimization , 2018, CoRL.

[20]  Eric P. Xing,et al.  AutoLoss: Learning Discrete Schedules for Alternate Optimization , 2018, ICLR 2018.

[21]  Vikash Kumar,et al.  A Game Theoretic Framework for Model Based Reinforcement Learning , 2020, ICML.

[22]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Yuandong Tian,et al.  Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[24]  Jian Shen,et al.  Model-based Policy Optimization with Unsupervised Model Adaptation , 2020, NeurIPS.

[25]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[26]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[27]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[28]  Qing He,et al.  Trust the Model When It Is Confident: Masked Model-based Actor-Critic , 2020, NeurIPS.

[29]  Sergey Levine,et al.  Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[30]  Pieter Abbeel,et al.  Model-Augmented Actor-Critic: Backpropagating through Paths , 2020, ICLR.

[31]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[32]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.