Policy Optimization with Stochastic Mirror Descent

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO needs only O(ε−3) sample trajectories to achieve an ε-approximate first-order stationary point, which matches the best sample complexity for policy optimization. Extensive empirical results demonstrate that VRMP outperforms the state-of-the-art policy gradient methods in various settings.

[1]  Pengfei Li,et al.  CUP: A Conservative Update Policy Algorithm for Safe Reinforcement Learning , 2022, ArXiv.

[2]  Heng Huang,et al.  Bregman Gradient Policy Optimization , 2021, ICLR.

[3]  Gang Pan,et al.  Sample Complexity of Policy Gradient Finding Second-Order Stationary Points , 2020, AAAI.

[4]  M. Ghavamzadeh,et al.  Mirror Descent Policy Optimization , 2020, ICLR.

[5]  Quanquan Gu,et al.  Sample Efficient Policy Gradient Methods with Recursive Variance Reduction , 2019, ICLR.

[6]  Shie Mannor,et al.  Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs , 2019, AAAI.

[7]  Ching-An Cheng,et al.  Trajectory-wise Control Variates for Variance Reduction in Policy Gradient Methods , 2019, CoRL.

[8]  S. Kakade,et al.  Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.

[9]  Qi Cai,et al.  Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[10]  Quanquan Gu,et al.  An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient , 2019, UAI.

[11]  Alejandro Ribeiro,et al.  Hessian Aided Policy Gradient , 2019, ICML.

[12]  Byron Boots,et al.  Predictor-Corrector Policy Optimization , 2018, ICML.

[13]  Yuren Zhou,et al.  Policy Optimization via Stochastic Recursive Gradient Algorithm , 2018 .

[14]  Marcello Restelli,et al.  Policy Optimization via Importance Sampling , 2018, NeurIPS.

[15]  Hongzi Mao,et al.  Variance Reduction for Reinforcement Learning in Input-Driven Environments , 2018, ICLR.

[16]  Pengfei Li,et al.  Qualitative Measurements of Policy Discrepancy for Return-Based Deep Q-Network , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Marcello Restelli,et al.  Stochastic Variance-Reduced Policy Gradient , 2018, ICML.

[18]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[19]  Alexandre M. Bayen,et al.  Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines , 2018, ICLR.

[20]  Gang Pan,et al.  A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning , 2018, IJCAI.

[21]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[22]  Le Song,et al.  SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[23]  Prateek Jain,et al.  Non-convex Optimization for Machine Learning , 2017, Found. Trends Mach. Learn..

[24]  David Duvenaud,et al.  Backpropagation through the Void: Optimizing control variates for black-box gradient estimation , 2017, ICLR.

[25]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[26]  Jian Peng,et al.  Stochastic Variance Reduction for Policy Gradient Estimation , 2017, ArXiv.

[27]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[28]  Lihong Li,et al.  Stochastic Variance Reduction Methods for Policy Evaluation , 2017, ICML.

[29]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[30]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[31]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[32]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[33]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[34]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[35]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[36]  Michael I. Jordan,et al.  Trust Region Policy Optimization , 2015, ICML.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[39]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[40]  Saeed Ghadimi,et al.  Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization , 2013, Mathematical Programming.

[41]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[42]  Jan Peters,et al.  Relative Entropy Policy Search , 2010, AAAI.

[43]  Dimitri P. Bertsekas,et al.  Convex Optimization Theory , 2009 .

[44]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[45]  Lex Weaver,et al.  The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[46]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[47]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[48]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[49]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[50]  Qian Zheng,et al.  Learning with Generated Teammates to Achieve Type-Free Ad-Hoc Teamwork , 2021, IJCAI.

[51]  Hao Liu,et al.  Action-dependent Control Variates for Policy Optimization via Stein Identity , 2018, ICLR.

[52]  Ke Tang,et al.  Stochastic Composite Mirror Descent: Optimal Bounds with High Probabilities , 2018, NeurIPS.

[53]  G. Crooks On Measures of Entropy and Information , 2015 .

[54]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[55]  Vijay R. Konda,et al.  Actor-Critic Algorithms , 1999, NIPS.

[56]  C. Stein Approximate computation of expectations , 1986 .

[57]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .