论文信息 - Quasi-Newton Trust Region Policy Optimization

Quasi-Newton Trust Region Policy Optimization

We propose a trust region method for policy optimization that employs Quasi-Newton approximation for the Hessian, called Quasi-Newton Trust Region Policy Optimization QNTRPO. Gradient descent is the de facto algorithm for reinforcement learning tasks with continuous controls. The algorithm has achieved state-of-the-art performance when used in reinforcement learning across a wide range of tasks. However, the algorithm suffers from a number of drawbacks including: lack of stepsize selection criterion, and slow convergence. We investigate the use of a trust region method using dogleg step and a Quasi-Newton approximation for the Hessian for policy optimization. We demonstrate through numerical experiments over a wide range of challenging continuous control tasks that our particular choice is efficient in terms of number of samples and improves performance

Diego Romeres | Devesh Jha | Arvind Raghunathan

[1] Balaraman Ravindran,et al. EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[2] D K Smith,et al. Numerical Optimization , 2001, J. Oper. Res. Soc..

[3] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[4] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[5] Yuval Tassa,et al. Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[6] Jan Peters,et al. Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[7] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[8] Purnamrita Sarkar,et al. A scalable bootstrap for massive data , 2011, 1112.5016.

[9] Devesh K. Jha,et al. Semiparametrical Gaussian Processes Learning of Forward Dynamical Models for Navigating in a Circular Maze , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[10] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[11] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[12] Elman Mansimov,et al. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[13] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14] Stefan Schaal,et al. Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15] Sergey Levine,et al. Guided Policy Search , 2013, ICML.

[16] Yuval Tassa,et al. Synthesis and stabilization of complex behaviors through online trajectory optimization , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[18] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[20] Alan Sullivan,et al. Sim-to-Real Transfer Learning using Robustified Controllers in Robotic Tasks involving Complex Dynamics , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[21] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[22] Jan Peters,et al. A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[23] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.