On the Global Convergence Rates of Softmax Policy Gradient Methods

We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a Łojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate $O(e^{-t})$ toward softmax optimal policy. This result resolves an open question in the recent literature. Finally, combining the above two results and additional new $\Omega(1/t)$ lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. The separation of rates is further explained using the notion of non-uniform Łojasiewicz degree. These results provide a theoretical understanding of the impact of entropy and corroborate existing empirical studies.

[1]  Martin Müller,et al.  On Principled Entropy Exploration in Policy Optimization , 2019, IJCAI.

[2]  Jalaj Bhandari,et al.  Global Optimality Guarantees For Policy Gradient Methods , 2019, ArXiv.

[3]  Gene H. Golub,et al.  Some modified matrix eigenvalue problems , 1973, Milestones in Matrix Computation.

[4]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[5]  Neil Walton A Short Note on Soft-max and Policy Gradients in Bandits Problems , 2020, ArXiv.

[6]  Dale Schuurmans,et al.  Maximum Entropy Monte-Carlo Planning , 2019, NeurIPS.

[7]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[8]  Yurii Nesterov,et al.  Lectures on Convex Optimization , 2018 .

[9]  Yi Zhou,et al.  Convergence of Cubic Regularization for Nonconvex Optimization under KL Property , 2018, NeurIPS.

[10]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[11]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[12]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[13]  Ian Osband,et al.  Making Sense of Reinforcement Learning and Probabilistic Inference , 2020, ICLR.

[14]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[15]  Sham M. Kakade,et al.  Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.

[16]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[17]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[18]  Nicolas Le Roux,et al.  Understanding the impact of entropy on policy optimization , 2018, ICML.

[19]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[20]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[21]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[22]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[23]  Shie Mannor,et al.  Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs , 2020, AAAI.

[24]  Tomáš Bárta Rate of Convergence to Equilibrium and Łojasiewicz-Type Estimates , 2017 .