论文信息 - Policy Mirror Descent for Reinforcement Learning: Linear Convergence, New Sampling Complexity, and Generalized Problem Classes

Policy Mirror Descent for Reinforcement Learning: Linear Convergence, New Sampling Complexity, and Generalized Problem Classes

We present new policy mirror descent (PMD) methods for solving reinforcement learning (RL) problems with either strongly convex or general convex regularizers. By exploring the structural properties of these overall highly nonconvex problems we show that the PMD methods exhibit fast linear rate of convergence to the global optimality. We develop stochastic counterparts of these methods, and establish an O(1/ǫ) (resp., O(1/ǫ)) sampling complexity for solving these RL problems with strongly (resp., general) convex regularizers using different sampling schemes, where ǫ denote the target accuracy. We further show that the complexity for computing the gradients of these regularizers, if necessary, can be bounded by O{(logγ ǫ)[(1−γ)L/μ] log(1/ǫ)} (resp., O{(logγ ǫ)(L/ǫ)}) for problems with strongly (resp., general) convex regularizers. Here γ denotes the discounting factor. To the best of our knowledge, these complexity bounds, along with our algorithmic developments, appear to be new in both optimization and RL literature. The introduction of these convex regularizers also greatly enhances the flexibility and thus expands the applicability of RL models.

Guanghui Lan

[1] Yishay Mansour,et al. Online Markov Decision Processes , 2009, Math. Oper. Res..

[2] Guanghui Lan,et al. Simple and optimal methods for stochastic variational inequalities, II: Markovian noise and policy evaluation in reinforcement learning , 2020, SIAM J. Optim..

[3] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[4] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[5] Yuxin Chen,et al. Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization , 2020, Oper. Res..

[6] Sham M. Kakade,et al. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[7] F. Facchinei,et al. Finite-Dimensional Variational Inequalities and Complementarity Problems , 2003 .

[8] Shie Mannor,et al. Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs , 2020, AAAI.

[9] Guanghui Lan,et al. Simple and optimal methods for stochastic variational inequalities, I: operator extrapolation , 2020, SIAM J. Optim..

[10] Guanghui Lan,et al. First-order and Stochastic Optimization Methods for Machine Learning , 2020 .

[11] Qi Cai,et al. Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.