Bregman Gradient Policy Optimization

In this paper, we design a novel Bregman gradient policy optimization framework for reinforcement learning based on Bregman divergences and momentum techniques. Specifically, we propose a Bregman gradient policy optimization (BGPO) algorithm based on the basic momentum technique and mirror descent iteration. At the same time, we present an accelerated Bregman gradient policy optimization (VR-BGPO) algorithm based on a momentum variance-reduced technique. Moreover, we introduce a convergence analysis framework for our Bregman gradient policy optimization under the nonconvex setting. Specifically, we prove that BGPO achieves the sample complexity of Õ( −4) for finding -stationary point only requiring one trajectory at each iteration, and VR-BGPO reaches the best known sample complexity of Õ( −3) for finding an -stationary point, which also only requires one trajectory at each iteration. In particular, by using different Bregman divergences, our methods unify many existing policy optimization algorithms and their new variants such as the existing (variance-reduced) policy gradient algorithms and (variance-reduced) natural policy gradient algorithms. Extensive experimental results on multiple reinforcement learning tasks demonstrate the efficiency of our new algorithms.

[1]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[2]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[3]  Marcello Restelli,et al.  Stochastic Variance-Reduced Policy Gradient , 2018, ICML.

[4]  Quanquan Gu,et al.  An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient , 2019, UAI.

[5]  Shie Mannor,et al.  Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs , 2020, AAAI.

[6]  Aaron Sidford,et al.  Efficiently Solving MDPs with Stochastic Mirror Descent , 2020, ICML.

[7]  Niao He,et al.  On the Convergence Rate of Stochastic Mirror Descent for Nonsmooth Nonconvex Optimization , 2018, 1806.04781.

[8]  Qi Cai,et al.  Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[9]  Mohammad Ghavamzadeh,et al.  Mirror Descent Policy Optimization , 2020, ArXiv.

[10]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[11]  Francesco Orabona,et al.  Momentum-Based Variance Reduction in Non-Convex SGD , 2019, NeurIPS.

[12]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[13]  Amnon Shashua,et al.  Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving , 2016, ArXiv.

[14]  Yu Zhang,et al.  Policy Optimization with Stochastic Mirror Descent , 2019, ArXiv.

[15]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[16]  Guanghui Lan Policy Mirror Descent for Reinforcement Learning: Linear Convergence, New Sampling Complexity, and Generalized Problem Classes , 2021, ArXiv.

[17]  Brendan O'Donoghue,et al.  Sample Efficient Reinforcement Learning with REINFORCE , 2020, AAAI.

[18]  Hao Zhu,et al.  Global Convergence of Policy Gradient Methods to (Almost) Locally Optimal Policies , 2019, SIAM J. Control. Optim..

[19]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[20]  Wotao Yin,et al.  An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods , 2022, NeurIPS.

[21]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[22]  Hilbert J. Kappen,et al.  Dynamic policy programming , 2010, J. Mach. Learn. Res..

[23]  Feihu Huang,et al.  SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients , 2021, ArXiv.

[24]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[25]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[26]  Quanquan Gu,et al.  Sample Efficient Policy Gradient Methods with Recursive Variance Reduction , 2019, ICLR.

[27]  Feihu Huang,et al.  Momentum-Based Policy Gradient Methods , 2020, ICML.

[28]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[29]  Yuejie Chi,et al.  Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence , 2021, ArXiv.

[30]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[31]  Yuxi Li,et al.  Deep Reinforcement Learning: An Overview , 2017, ArXiv.

[32]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[33]  Sham M. Kakade,et al.  On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[34]  Yuxin Chen,et al.  Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization , 2020, Oper. Res..

[35]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[36]  Tong Zhang,et al.  Divergence-Augmented Policy Optimization , 2019, NeurIPS.

[37]  Alejandro Ribeiro,et al.  Hessian Aided Policy Gradient , 2019, ICML.

[38]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[39]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[40]  Yishay Mansour,et al.  Learning Bounds for Importance Weighting , 2010, NIPS.

[41]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[42]  Saeed Ghadimi,et al.  Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization , 2013, Mathematical Programming.