论文信息 - Control Regularization for Reduced Variance Reinforcement Learning

Control Regularization for Reduced Variance Reinforcement Learning

Dealing with high variance is a significant challenge in model-free reinforcement learning (RL). Existing methods are unreliable, exhibiting high variance in performance from run to run using different initializations/seeds. Focusing on problems arising in continuous control, we propose a functional regularization approach to augmenting model-free RL. In particular, we regularize the behavior of the deep policy to be similar to a policy prior, i.e., we regularize in function space. We show that functional regularization yields a bias-variance trade-off, and propose an adaptive tuning strategy to optimize this trade-off. When the policy prior has control-theoretic stability guarantees, we further show that this regularization approximately preserves those stability guarantees throughout learning. We validate our approach empirically on a range of settings, and demonstrate significantly reduced variance, guaranteed dynamic stability, and more efficient learning than deep RL alone.

[1] Gang Niu,et al. Analysis and Improvement of Policy Gradient Estimation , 2011, NIPS.

[2] Lex Weaver,et al. The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[3] Chaozhe R. He,et al. Experimental validation of connected automated vehicle design among human-driven vehicles , 2018, Transportation Research Part C: Emerging Technologies.

[4] P. Khargonekar,et al. State-space solutions to standard H/sub 2/ and H/sub infinity / control problems , 1989 .

[5] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[6] Petros A. Ioannou,et al. Multiple Model Adaptive Control With Mixing , 2010, IEEE Transactions on Automatic Control.

[7] Sergey Levine,et al. Divide-and-Conquer Reinforcement Learning , 2017, ICLR.

[8] Andrew G. Barto,et al. Lyapunov Design for Safe Reinforcement Learning , 2003, J. Mach. Learn. Res..

[9] Abhinav Verma,et al. Programmatically Interpretable Reinforcement Learning , 2018, ICML.

[10] Jonas Buchli,et al. Learning of closed-loop motion control , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11] Hassan K. Khalil,et al. Nonlinear Systems Third Edition , 2008 .

[12] Ofir Nachum,et al. A Lyapunov-based Approach to Safe Reinforcement Learning , 2018, NeurIPS.

[13] Peter L. Bartlett,et al. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[14] Andreas Krause,et al. Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[15] Philip Bachman,et al. Deep Reinforcement Learning that Matters , 2017, AAAI.

[16] Mi-Ching Tsai,et al. Robust and Optimal Control , 2014 .

[17] Gang Niu,et al. Regularized Policy Gradients: Direct Variance Reduction in Policy Gradient Estimation , 2015, ACML.

[18] Peter L. Bartlett,et al. Reinforcement Learning in POMDP's via Direct Gradient Ascent , 2000, ICML.

[19] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[20] Peter Henderson,et al. Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control , 2017, ArXiv.

[21] Pieter Abbeel,et al. Constrained Policy Optimization , 2017, ICML.

[22] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[23] Marc Peter Deisenroth,et al. Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[24] Sergey Levine,et al. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[25] Yisong Yue,et al. Batch Policy Learning under Constraints , 2019, ICML.

[26] Benjamin Recht,et al. A Tour of Reinforcement Learning: The View from Continuous Control , 2018, Annu. Rev. Control. Robotics Auton. Syst..

[27] ∥∥∥u− uθk. Appendix : Control Regularization for Reduced Variance Reinforcement Learning , 2019 .

[28] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[29] Christos Dimitrakakis,et al. TORCS, The Open Racing Car Simulator , 2005 .

[30] Sergey Levine,et al. Residual Reinforcement Learning for Robot Control , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[31] Javier García,et al. A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[32] Alexandre M. Bayen,et al. Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines , 2018, ICLR.

[33] David Rolnick,et al. Measuring and regularizing networks in function space , 2018, ICLR.

[34] Yisong Yue,et al. Smooth Imitation Learning for Online Sequence Prediction , 2016, ICML.

[35] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[36] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[37] Joelle Pineau,et al. Temporal Regularization in Markov Decision Process , 2018, ArXiv.

[38] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[39] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.