论文信息 - A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning

A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning

We propose a novel hybrid stochastic policy gradient estimator by combining an unbiased policy gradient estimator, the REINFORCE estimator, with another biased one, an adapted SARAH estimator for policy optimization. The hybrid policy gradient estimator is shown to be biased, but has variance reduced property. Using this estimator, we develop a new Proximal Hybrid Stochastic Policy Gradient Algorithm (ProxHSPGA) to solve a composite policy optimization problem that allows us to handle constraints or regularizers on the policy parameters. We first propose a single-looped algorithm then introduce a more practical restarting variant. We prove that both algorithms can achieve the best-known trajectory complexity $\mathcal{O}\left(\varepsilon^{-3}\right)$ to attain a first-order stationary point for the composite problem which is better than existing REINFORCE/GPOMDP $\mathcal{O}\left(\varepsilon^{-4}\right)$ and SVRPG $\mathcal{O}\left(\varepsilon^{-10/3}\right)$ in the non-composite setting. We evaluate the performance of our algorithm on several well-known examples in reinforcement learning. Numerical results show that our algorithm outperforms two existing methods on these examples. Moreover, the composite settings indeed have some advantages compared to the non-composite ones on certain problems.

[1] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[2] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[4] Alexander J. Smola,et al. Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[5] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[6] Yishay Mansour,et al. Learning Bounds for Importance Weighting , 2010, NIPS.

[7] Demis Hassabis,et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[8] Quanquan Gu,et al. An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient , 2019, UAI.

[9] John S. Bridle,et al. Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters , 1989, NIPS.

[10] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[11] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[12] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[13] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[14] Eduardo F. Morales,et al. An Introduction to Reinforcement Learning , 2011 .

[15] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[16] Quanquan Gu,et al. Stochastic Nested Variance Reduction for Nonconvex Optimization , 2018, J. Mach. Learn. Res..

[17] Lam M. Nguyen,et al. Hybrid Stochastic Gradient Descent Algorithms for Stochastic Nonconvex Optimization , 2019, 1905.05920.

[18] Jie Liu,et al. SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[19] Marcello Restelli,et al. Stochastic Variance-Reduced Policy Gradient , 2018, ICML.

[20] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[21] Nando de Freitas,et al. Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[22] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[23] Gang Niu,et al. Analysis and Improvement of Policy Gradient Estimation , 2011, NIPS.

[24] Elman Mansimov,et al. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[25] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[26] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[27] Lam M. Nguyen,et al. ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization , 2019, J. Mach. Learn. Res..

[28] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[29] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[30] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[31] H. Robbins. A Stochastic Approximation Method , 1951 .

[32] Yu Zhang,et al. Policy Optimization with Stochastic Mirror Descent , 2019, ArXiv.

[33] Quanquan Gu,et al. Sample Efficient Policy Gradient Methods with Recursive Variance Reduction , 2020, ICLR.

[34] Jian Li,et al. A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization , 2018, NeurIPS.

[35] Marten van Dijk,et al. Finite-sum smooth optimization with SARAH , 2019, Computational Optimization and Applications.

[36] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[37] Tong Zhang,et al. SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.