Control Regularization for Reduced Variance Reinforcement Learning

Dealing with high variance is a significant challenge in model-free reinforcement learning (RL). Existing methods are unreliable, exhibiting high variance in performance from run to run using different initializations/seeds. Focusing on problems arising in continuous control, we propose a functional regularization approach to augmenting model-free RL. In particular, we regularize the behavior of the deep policy to be similar to a policy prior, i.e., we regularize in function space. We show that functional regularization yields a bias-variance trade-off, and propose an adaptive tuning strategy to optimize this trade-off. When the policy prior has control-theoretic stability guarantees, we further show that this regularization approximately preserves those stability guarantees throughout learning. We validate our approach empirically on a range of settings, and demonstrate significantly reduced variance, guaranteed dynamic stability, and more efficient learning than deep RL alone.

[1]  Gang Niu,et al.  Analysis and Improvement of Policy Gradient Estimation , 2011, NIPS.

[2]  Lex Weaver,et al.  The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[3]  Chaozhe R. He,et al.  Experimental validation of connected automated vehicle design among human-driven vehicles , 2018, Transportation Research Part C: Emerging Technologies.

[4]  P. Khargonekar,et al.  State-space solutions to standard H/sub 2/ and H/sub infinity / control problems , 1989 .

[5]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[6]  Petros A. Ioannou,et al.  Multiple Model Adaptive Control With Mixing , 2010, IEEE Transactions on Automatic Control.

[7]  Sergey Levine,et al.  Divide-and-Conquer Reinforcement Learning , 2017, ICLR.

[8]  Andrew G. Barto,et al.  Lyapunov Design for Safe Reinforcement Learning , 2003, J. Mach. Learn. Res..

[9]  Abhinav Verma,et al.  Programmatically Interpretable Reinforcement Learning , 2018, ICML.

[10]  Jonas Buchli,et al.  Learning of closed-loop motion control , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  Hassan K. Khalil,et al.  Nonlinear Systems Third Edition , 2008 .

[12]  Ofir Nachum,et al.  A Lyapunov-based Approach to Safe Reinforcement Learning , 2018, NeurIPS.

[13]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[14]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[15]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[16]  Mi-Ching Tsai,et al.  Robust and Optimal Control , 2014 .

[17]  Gang Niu,et al.  Regularized Policy Gradients: Direct Variance Reduction in Policy Gradient Estimation , 2015, ACML.

[18]  Peter L. Bartlett,et al.  Reinforcement Learning in POMDP's via Direct Gradient Ascent , 2000, ICML.

[19]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[20]  Peter Henderson,et al.  Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control , 2017, ArXiv.

[21]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[22]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[23]  Marc Peter Deisenroth,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[24]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Yisong Yue,et al.  Batch Policy Learning under Constraints , 2019, ICML.

[26]  Benjamin Recht,et al.  A Tour of Reinforcement Learning: The View from Continuous Control , 2018, Annu. Rev. Control. Robotics Auton. Syst..

[27]  ∥∥∥u− uθk Appendix : Control Regularization for Reduced Variance Reinforcement Learning , 2019 .

[28]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[29]  Christos Dimitrakakis,et al.  TORCS, The Open Racing Car Simulator , 2005 .

[30]  Sergey Levine,et al.  Residual Reinforcement Learning for Robot Control , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[31]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[32]  Alexandre M. Bayen,et al.  Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines , 2018, ICLR.

[33]  David Rolnick,et al.  Measuring and regularizing networks in function space , 2018, ICLR.

[34]  Yisong Yue,et al.  Smooth Imitation Learning for Online Sequence Prediction , 2016, ICML.

[35]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[36]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[37]  Joelle Pineau,et al.  Temporal Regularization in Markov Decision Process , 2018, ArXiv.

[38]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[39]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.