论文信息 - Policy Optimization with Stochastic Mirror Descent - 字舞流文

Policy Optimization with Stochastic Mirror Descent

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO needs only O(ε−3) sample trajectories to achieve an ε-approximate first-order stationary point, which matches the best sample complexity for policy optimization. Extensive empirical results demonstrate that VRMP outperforms the state-of-the-art policy gradient methods in various settings.

Gang Pan | Qian Zheng | Pengfei Li | Long Yang | Gang Zheng | Jianhang Huang | Yu Zhang

[1] Pengfei Li,et al. CUP: A Conservative Update Policy Algorithm for Safe Reinforcement Learning , 2022, ArXiv.

[2] Heng Huang,et al. Bregman Gradient Policy Optimization , 2021, ICLR.

[3] Gang Pan,et al. Sample Complexity of Policy Gradient Finding Second-Order Stationary Points , 2020, AAAI.

[4] M. Ghavamzadeh,et al. Mirror Descent Policy Optimization , 2020, ICLR.

[5] Quanquan Gu,et al. Sample Efficient Policy Gradient Methods with Recursive Variance Reduction , 2019, ICLR.

[6] Shie Mannor,et al. Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs , 2019, AAAI.

[7] Ching-An Cheng,et al. Trajectory-wise Control Variates for Variance Reduction in Policy Gradient Methods , 2019, CoRL.

[8] S. Kakade,et al. Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.

[9] Qi Cai,et al. Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[10] Quanquan Gu,et al. An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient , 2019, UAI.

[11] Alejandro Ribeiro,et al. Hessian Aided Policy Gradient , 2019, ICML.

[12] Byron Boots,et al. Predictor-Corrector Policy Optimization , 2018, ICML.

[13] Yuren Zhou,et al. Policy Optimization via Stochastic Recursive Gradient Algorithm , 2018 .

[14] Marcello Restelli,et al. Policy Optimization via Importance Sampling , 2018, NeurIPS.

[15] Hongzi Mao,et al. Variance Reduction for Reinforcement Learning in Input-Driven Environments , 2018, ICLR.

[16] Pengfei Li,et al. Qualitative Measurements of Policy Discrepancy for Return-Based Deep Q-Network , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[17] Marcello Restelli,et al. Stochastic Variance-Reduced Policy Gradient , 2018, ICML.

[18] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[19] Alexandre M. Bayen,et al. Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines , 2018, ICLR.

[20] Gang Pan,et al. A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning , 2018, IJCAI.

[21] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[22] Le Song,et al. SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[23] Prateek Jain,et al. Non-convex Optimization for Machine Learning , 2017, Found. Trends Mach. Learn..

[24] David Duvenaud,et al. Backpropagation through the Void: Optimizing control variates for black-box gradient estimation , 2017, ICLR.

[25] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[26] Jian Peng,et al. Stochastic Variance Reduction for Policy Gradient Estimation , 2017, ArXiv.

[27] Jie Liu,et al. SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[28] Lihong Li,et al. Stochastic Variance Reduction Methods for Policy Evaluation , 2017, ICML.

[29] Sergey Levine,et al. Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[30] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[31] Alexander J. Smola,et al. Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[32] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[33] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[34] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[35] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[36] Michael I. Jordan,et al. Trust Region Policy Optimization , 2015, ICML.

[37] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[39] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[40] Saeed Ghadimi,et al. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization , 2013, Mathematical Programming.

[41] Heinz H. Bauschke,et al. Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[42] Jan Peters,et al. Relative Entropy Policy Search , 2010, AAAI.

[43] Dimitri P. Bertsekas,et al. Convex Optimization Theory , 2009 .

[44] Marc Teboulle,et al. Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[45] Lex Weaver,et al. The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[46] Peter L. Bartlett,et al. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[47] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[48] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[49] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[50] Qian Zheng,et al. Learning with Generated Teammates to Achieve Type-Free Ad-Hoc Teamwork , 2021, IJCAI.

[51] Hao Liu,et al. Action-dependent Control Variates for Policy Optimization via Stein Identity , 2018, ICLR.

[52] Ke Tang,et al. Stochastic Composite Mirror Descent: Optimal Bounds with High Probabilities , 2018, NeurIPS.

[53] G. Crooks. On Measures of Entropy and Information , 2015 .

[54] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[55] Vijay R. Konda,et al. Actor-Critic Algorithms , 1999, NIPS.

[56] C. Stein. Approximate computation of expectations , 1986 .

[57] John Darzentas,et al. Problem Complexity and Method Efficiency in Optimization , 1983 .