Policy Mirror Descent for Reinforcement Learning: Linear Convergence, New Sampling Complexity, and Generalized Problem Classes

We present new policy mirror descent (PMD) methods for solving reinforcement learning (RL) problems with either strongly convex or general convex regularizers. By exploring the structural properties of these overall highly nonconvex problems we show that the PMD methods exhibit fast linear rate of convergence to the global optimality. We develop stochastic counterparts of these methods, and establish an O(1/ǫ) (resp., O(1/ǫ)) sampling complexity for solving these RL problems with strongly (resp., general) convex regularizers using different sampling schemes, where ǫ denote the target accuracy. We further show that the complexity for computing the gradients of these regularizers, if necessary, can be bounded by O{(logγ ǫ)[(1−γ)L/μ] log(1/ǫ)} (resp., O{(logγ ǫ)(L/ǫ)}) for problems with strongly (resp., general) convex regularizers. Here γ denotes the discounting factor. To the best of our knowledge, these complexity bounds, along with our algorithmic developments, appear to be new in both optimization and RL literature. The introduction of these convex regularizers also greatly enhances the flexibility and thus expands the applicability of RL models.

[1]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[2]  Guanghui Lan,et al.  Simple and optimal methods for stochastic variational inequalities, II: Markovian noise and policy evaluation in reinforcement learning , 2020, SIAM J. Optim..

[3]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[4]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[5]  Yuxin Chen,et al.  Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization , 2020, Oper. Res..

[6]  Sham M. Kakade,et al.  On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[7]  F. Facchinei,et al.  Finite-Dimensional Variational Inequalities and Complementarity Problems , 2003 .

[8]  Shie Mannor,et al.  Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs , 2020, AAAI.

[9]  Guanghui Lan,et al.  Simple and optimal methods for stochastic variational inequalities, I: operator extrapolation , 2020, SIAM J. Optim..

[10]  Guanghui Lan,et al.  First-order and Stochastic Optimization Methods for Machine Learning , 2020 .

[11]  Qi Cai,et al.  Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[12]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[13]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[14]  Dale Schuurmans,et al.  On the Global Convergence Rates of Softmax Policy Gradient Methods , 2020, ICML.

[15]  Jalaj Bhandari,et al.  A Note on the Linear Convergence of Policy Gradient Methods , 2020, ArXiv.

[16]  Guanghui Lan,et al.  On the convergence properties of non-Euclidean extragradient methods for variational inequalities with generalized monotone operators , 2013, Comput. Optim. Appl..

[17]  Zhaoran Wang,et al.  Neural Policy Gradient Methods: Global Optimality and Rates of Convergence , 2019, ICLR.

[18]  R. Bellman,et al.  FUNCTIONAL APPROXIMATIONS AND DYNAMIC PROGRAMMING , 1959 .

[19]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[20]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[21]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[22]  Yingbin Liang,et al.  Improving Sample Complexity Bounds for Actor-Critic Algorithms , 2020, ArXiv.