An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient

We revisit the stochastic variance-reduced policy gradient (SVRPG) method proposed by Papini et al. (2018) for reinforcement learning. We provide an improved convergence analysis of SVRPG and show that it can find an $\epsilon$-approximate stationary point of the performance function within $O(1/\epsilon^{5/3})$ trajectories. This sample complexity improves upon the best known result $O(1/\epsilon^2)$ by a factor of $O(1/\epsilon^{1/3})$. At the core of our analysis is (i) a tighter upper bound for the variance of importance sampling weights, where we prove that the variance can be controlled by the parameter distance between different policies; and (ii) a fine-grained analysis of the epoch length and batch size parameters such that we can significantly reduce the number of trajectories required in each iteration of SVRPG. We also empirically demonstrate the effectiveness of our theoretical claims of batch sizes on reinforcement learning benchmark tasks.

[1]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[2]  A. Eigen-analysis Stochastic Variance Reduction Methods for Policy Evaluation , 2017 .

[3]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[4]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[5]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[6]  Luca Bascetta,et al.  Adaptive Step-Size for Policy Gradient Methods , 2013, NIPS.

[7]  Jun S. Liu,et al.  Monte Carlo strategies in scientific computing , 2001 .

[8]  A. Rényi On Measures of Entropy and Information , 1961 .

[9]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[10]  Michael I. Jordan,et al.  Non-convex Finite-Sum Optimization Via SCSG Methods , 2017, NIPS.

[11]  Jian Peng,et al.  Stochastic Variance Reduction for Policy Gradient Estimation , 2017, ArXiv.

[12]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[13]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[14]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[15]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[16]  Mark W. Schmidt,et al.  StopWasting My Gradients: Practical SVRG , 2015, NIPS.

[17]  Quanquan Gu,et al.  Stochastic Nested Variance Reduced Gradient Descent for Nonconvex Optimization , 2018, NeurIPS.

[18]  Amnon Shashua,et al.  Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving , 2016, ArXiv.

[19]  Sergey Levine,et al.  The Mirage of Action-Dependent Baselines in Reinforcement Learning , 2018, ICML.

[20]  Marcello Restelli,et al.  Stochastic Variance-Reduced Policy Gradient , 2018, ICML.

[21]  Nolan Wagener,et al.  Learning contact-rich manipulation skills with guided policy search , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[23]  Lex Weaver,et al.  The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[24]  Yishay Mansour,et al.  Learning Bounds for Importance Weighting , 2010, NIPS.

[25]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[26]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[27]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[28]  Marcello Restelli,et al.  Policy Optimization via Importance Sampling , 2018, NeurIPS.

[29]  Frank Sehnke,et al.  Parameter-exploring policy gradients , 2010, Neural Networks.

[30]  Jian Li,et al.  A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization , 2018, NeurIPS.

[31]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[32]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[33]  Gang Niu,et al.  Analysis and Improvement of Policy Gradient Estimation , 2011, NIPS.

[34]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[35]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[36]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[37]  H. Robbins A Stochastic Approximation Method , 1951 .

[38]  Marcello Restelli,et al.  Adaptive Batch Size for Safe Policy Gradients , 2017, NIPS.