论文信息 - Stochastic Variance-Reduced Policy Gradient

Stochastic Variance-Reduced Policy Gradient

In this paper, we propose a novel reinforcement- learning algorithm consisting in a stochastic variance-reduced version of policy gradient for solving Markov Decision Processes (MDPs). Stochastic variance-reduced gradient (SVRG) methods have proven to be very successful in supervised learning. However, their adaptation to policy gradient is not straightforward and needs to account for I) a non-concave objective func- tion; II) approximations in the full gradient com- putation; and III) a non-stationary sampling pro- cess. The result is SVRPG, a stochastic variance- reduced policy gradient algorithm that leverages on importance weights to preserve the unbiased- ness of the gradient estimate. Under standard as- sumptions on the MDP, we provide convergence guarantees for SVRPG with a convergence rate that is linear under increasing batch sizes. Finally, we suggest practical variants of SVRPG, and we empirically evaluate them on continuous MDPs.

[1] Mark W. Schmidt,et al. StopWasting My Gradients: Practical SVRG , 2015, NIPS.

[2] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[3] Julien Mairal,et al. Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure , 2016, NIPS.

[4] Peter L. Bartlett,et al. Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[5] Reuven Y. Rubinstein,et al. Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[6] Lihong Li,et al. Stochastic Variance Reduction Methods for Policy Evaluation , 2017, ICML.

[7] Jian Peng,et al. Stochastic Variance Reduction for Policy Gradient Estimation , 2017, ArXiv.

[8] David Barber,et al. A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes , 2012, NIPS.

[9] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[10] Zeyuan Allen Zhu,et al. Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[11] Francis R. Bach,et al. Stochastic Variance Reduction Methods for Saddle-Point Problems , 2016, NIPS.

[12] John Darzentas,et al. Problem Complexity and Method Efficiency in Optimization , 1983 .

[13] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[14] Lex Weaver,et al. The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[15] Yishay Mansour,et al. Learning Bounds for Importance Weighting , 2010, NIPS.

[16] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[17] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[18] Luca Bascetta,et al. Adaptive Step-Size for Policy Gradient Methods , 2013, NIPS.

[19] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20] Philip S. Thomas,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation and Action-Dependent Baselines , 2017, ArXiv.

[21] H. Robbins. A Stochastic Approximation Method , 1951 .

[22] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[23] Jie Liu,et al. Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting , 2015, IEEE Journal of Selected Topics in Signal Processing.

[24] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[25] Yann LeCun,et al. Large Scale Online Learning , 2003, NIPS.

[26] Mark W. Schmidt,et al. A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[27] Alexander J. Smola,et al. Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[28] Marcello Restelli,et al. Adaptive Batch Size for Safe Policy Gradients , 2017, NIPS.

[29] Masashi Sugiyama,et al. Analysis and Improvement of Policy Gradient Estimation , 2011 .

[30] Saeed Ghadimi,et al. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[31] Luca Bascetta,et al. Policy gradient in Lipschitz Markov Decision Processes , 2015, Machine Learning.

[32] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33] Gang Niu,et al. Analysis and Improvement of Policy Gradient Estimation , 2011, NIPS.

[34] Stefan Schaal,et al. 2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[35] Julien Mairal,et al. Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning , 2014, SIAM J. Optim..

[36] Justin Domke,et al. Finito: A faster, permutable incremental gradient method for big data problems , 2014, ICML.

[37] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[38] Alexander J. Smola,et al. Fast Incremental Method for Nonconvex Optimization , 2016, ArXiv.

[39] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[40] Filip Jurčı́ček,et al. Reinforcement learning for spoken dialogue systems using off-policy natural gradient method , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).