论文信息 - Bidirectional Model-based Policy Optimization

Bidirectional Model-based Policy Optimization

Model-based reinforcement learning approaches leverage a forward dynamics model to support planning and decision making, which, however, may fail catastrophically if the model is inaccurate. Although there are several existing methods dedicated to combating the model error, the potential of the single forward model is still limited. In this paper, we propose to additionally construct a backward dynamics model to reduce the reliance on accuracy in forward model predictions. We develop a novel method, called Bidirectional Model-based Policy Optimization (BMPO) to utilize both the forward model and backward model to generate short branched rollouts for policy optimization. Furthermore, we theoretically derive a tighter bound of return discrepancy, which shows the superiority of BMPO against the one using merely the forward model. Extensive experiments demonstrate that BMPO outperforms state-of-the-art model-based methods in terms of sample efficiency and asymptotic performance.

Weinan Zhang | Yong Yu | Jian Shen | Hang Lai

[1] Erik Talvitie,et al. Self-Correcting Models for Model-Based Reinforcement Learning , 2016, AAAI.

[2] Nikolai Matni,et al. On the Sample Complexity of the Linear Quadratic Regulator , 2017, Foundations of Computational Mathematics.

[3] Csaba Szepesvári,et al. Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[4] Richard S. Sutton,et al. Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[5] Pieter Abbeel,et al. Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[6] C. Rasmussen,et al. Improving PILCO with Bayesian Neural Network Dynamics Models , 2016 .

[7] Sergey Levine,et al. Recall Traces: Backtracking Models for Efficient Reinforcement Learning , 2018, ICLR.

[8] Jimmy Ba,et al. Exploring Model-based Planning with Policy Networks , 2019, ICLR.

[9] Kavosh Asadi,et al. Lipschitz Continuity in Model-based Reinforcement Learning , 2018, ICML.

[10] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11] Sergey Levine,et al. Optimal control with learned local models: Application to dexterous manipulation , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[12] Martial Hebert,et al. Improving Multi-Step Prediction of Learned Time Series Models , 2015, AAAI.

[13] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[14] Nhat M. Nguyen. Improving model-based RL with Adaptive Rollout using Uncertainty Estimation , 2018 .

[15] Yuandong Tian,et al. Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[16] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[17] Pieter Abbeel,et al. Benchmarking Model-Based Reinforcement Learning , 2019, ArXiv.

[18] Luke Metz,et al. Learning to Predict Without Looking Ahead: World Models Without Forward Prediction , 2019, NeurIPS.

[19] Dieter Fox,et al. Gaussian Processes and Reinforcement Learning for Identification and Control of an Autonomous Blimp , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[20] Pieter Abbeel,et al. Prediction and Control with Temporal Segment Models , 2017, ICML.

[21] Balaraman Ravindran,et al. EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[22] Yifan Wu,et al. Learning to Combat Compounding-Error in Model-Based Reinforcement Learning , 2019, ArXiv.