First-order Policy Optimization for Robust Markov Decision Process

We consider the problem of solving robust Markov decision process (MDP), which involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels. The goal of planning is to find a robust policy that optimizes the worst-case values against the transition uncer-tainties, and thus encompasses the standard MDP planning as a special case. For ( s , a )-rectangular uncertainty sets, we develop a policy-based first-order method, namely the robust policy mirror descent (RPMD), and establish an O (log(1 /(cid:15) )) and O (1 /(cid:15) ) iteration complexity for finding an (cid:15) -optimal policy, with two increasing-stepsize schemes. The prior convergence of RPMD is applicable to any Bregman divergence, provided the policy space has bounded radius measured by the divergence when centering at the initial policy. Moreover, when the Bregman divergence corresponds to the squared euclidean distance, we establish an O (cid:0) max (cid:8) 1 /(cid:15), 1 / ( η(cid:15) 2 ) (cid:9)(cid:1) complexity of RPMD with any constant stepsize η . For a general class of Bregman divergences, a similar complexity is also established for RPMD with constant stepsizes, provided the uncertainty set satisfies the relative strong convexity. We further develop a stochastic variant of the robust policy mirror descent method, named SRPMD, when the first-order information is only available through online interactions with the nominal environment. For general Bregman divergences, we establish an O (1 /(cid:15) 2 ) and O (1 /(cid:15) 3 ) sample complexity with two increasing-stepsize schemes. For the euclidean Bregman divergence, we establish an O (1 /(cid:15) 3 ) sample complexity with constant stepsizes. To the best of our knowledge, all the aforementioned results appear to be new for policy-based first-order methods applied to the robust MDP problem.

[1]  Shie Mannor,et al.  Efficient Policy Iteration for Robust Markov Decision Processes via Regularization , 2022, ArXiv.

[2]  Shaofeng Zou,et al.  Policy Gradient Method For Robust Reinforcement Learning , 2022, ICML.

[3]  Guanghui Lan,et al.  Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity , 2022, ArXiv.

[4]  Lin Xiao On the Convergence Rates of Policy Gradient Methods , 2022, J. Mach. Learn. Res..

[5]  D. Kalathil,et al.  Sample Complexity of Robust Reinforcement Learning with a Generative Model , 2021, AISTATS.

[6]  Yuxin Chen,et al.  Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization , 2020, Oper. Res..

[7]  Vineet Goyal,et al.  A First-Order Approach to Accelerated Value Iteration , 2019, Oper. Res..

[8]  Vineet Goyal,et al.  Robust Markov Decision Process: Beyond Rectangularity , 2018, 1811.00215.

[9]  Shie Mannor,et al.  Twice regularized MDPs and the equivalence between robustness and regularization , 2021, NeurIPS.

[10]  Shaofeng Zou,et al.  Online Robust Reinforcement Learning with Model Uncertainty , 2021, NeurIPS.

[11]  Prakirt Raj Jhunjhunwala,et al.  On the Linear Convergence of Natural Policy Gradient Algorithm , 2021, 2021 60th IEEE Conference on Decision and Control (CDC).

[12]  Siva Theja Maguluri,et al.  A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants , 2021, ArXiv.

[13]  Guanghui Lan Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes , 2021, Mathematical Programming.

[14]  Dileep Kalathil,et al.  Robust Reinforcement Learning using Least Squares Policy Iteration with Provable Performance Guarantees , 2020, ICML.

[15]  Wolfram Wiesemann,et al.  Partial Policy Iteration for L1-Robust Markov Decision Processes , 2020, J. Mach. Learn. Res..

[16]  Christian Kroer,et al.  Scalable First-Order Methods for Robust MDPs , 2020, AAAI.

[17]  Sham M. Kakade,et al.  On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[18]  Tuo Zhao,et al.  Implicit Bias of Gradient Descent based Adversarial Training on Separable Data , 2020, ICLR.

[19]  Yingbin Liang,et al.  Improving Sample Complexity Bounds for Actor-Critic Algorithms , 2020, ArXiv.

[20]  Tuo Zhao,et al.  Deep Reinforcement Learning with Robust and Smooth Policy , 2020, ICML.

[21]  Andrzej Ruszczynski,et al.  Risk-Averse Learning by Temporal Difference Methods , 2020, ArXiv.

[22]  Shie Mannor,et al.  Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs , 2019, AAAI.

[23]  Guanghui Lan,et al.  First-order and Stochastic Optimization Methods for Machine Learning , 2020 .

[24]  Aleksander Madry,et al.  Robustness May Be at Odds with Accuracy , 2018, ICLR.

[25]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[26]  Aurko Roy,et al.  Reinforcement Learning under Model Mismatch , 2017, NIPS.

[27]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[28]  Shie Mannor,et al.  Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[29]  Andrew J. Schaefer,et al.  Robust Modified Policy Iteration , 2013, INFORMS J. Comput..

[30]  Daniel Kuhn,et al.  Robust Markov Decision Processes , 2013, Math. Oper. Res..

[31]  Andrzej Ruszczynski,et al.  Risk-averse dynamic programming for Markov decision processes , 2010, Math. Program..

[32]  Shie Mannor,et al.  Robust Regression and Lasso , 2008, IEEE Transactions on Information Theory.

[33]  Shie Mannor,et al.  Robustness and Regularization of Support Vector Machines , 2008, J. Mach. Learn. Res..

[34]  D. Vittone Introduction to Geometric Measure Theory , 2006 .

[35]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[36]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[37]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[38]  A. Kruger On Fréchet Subdifferentials , 2003 .

[39]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[40]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[41]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[42]  Â. J. Vial,et al.  Strong Convexity of Sets and Functions , 1982 .

[43]  J. Danskin The Theory of Max-Min and its Application to Weapons Allocation Problems , 1967 .