论文信息 - Bootstrapping the Expressivity with Model-based Planning

Bootstrapping the Expressivity with Model-based Planning

We compare the model-free reinforcement learning with the model-based approaches through the lens of the expressive power of neural networks for policies, $Q$-functions, and dynamics. We show, theoretically and empirically, that even for one-dimensional continuous state space, there are many MDPs whose optimal $Q$-functions and policies are much more complex than the dynamics. We hypothesize many real-world MDPs also have a similar property. For these MDPs, model-based planning is a favorable algorithm, because the resulting policies can approximate the optimal policy significantly better than a neural network parameterization can, and model-free or model-based policy optimization rely on policy parameterization. Motivated by the theory, we apply a simple multi-step model-based bootstrapping planner (BOOTS) to bootstrap a weak $Q$-function into a stronger policy. Empirical results show that applying BOOTS on top of model-based or model-free policy optimization algorithms at the test time improves the performance on MuJoCo benchmark tasks.

Tengyu Ma | Yuping Luo | Kefan Dong

[1] Emma Brunskill,et al. Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[2] Sergey Levine,et al. Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[3] Demis Hassabis,et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[4] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[5] Sergey Levine,et al. When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[6] Tengyu Ma,et al. Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.

[7] Erik Talvitie,et al. Model Regularization for Stable Sample Rollouts , 2014, UAI.

[8] Tom Schaul,et al. The Predictron: End-To-End Learning and Planning , 2016, ICML.

[9] Sergey Levine,et al. Guided Policy Search , 2013, ICML.

[10] Lihong Li,et al. PAC model-free reinforcement learning , 2006, ICML.

[11] Yuandong Tian,et al. Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[12] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[14] Nan Jiang,et al. Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches , 2018, COLT.

[15] Nikolai Matni,et al. Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator , 2018, NeurIPS.

[16] Honglak Lee,et al. Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[17] Nikolai Matni,et al. On the Sample Complexity of the Linear Quadratic Regulator , 2017, Foundations of Computational Mathematics.

[18] Richard S. Sutton,et al. Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[19] Yoshua Bengio,et al. Probabilistic Planning with Sequential Monte Carlo methods , 2018, ICLR.

[20] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[21] Balaraman Ravindran,et al. EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[22] Tamim Asfour,et al. Model-Based Reinforcement Learning via Meta-Policy Optimization , 2018, CoRL.

[23] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[24] Pieter Abbeel,et al. Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.