Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be ‘close’ to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling mechanism used in TRPO is in fact the natural “RL version” of traditional trust-region methods from convex analysis. We first analyze TRPO in the planning setting, in which we have access to the model and the entire state space. Then, we consider sample-based TRPO and establish Õ(1/√N) convergence rate to the global optimum. Importantly, the adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for which we prove fast rates of Õ(1/N), much like results in convex optimization. This is the first result in RL of better rates when regularizing the instantaneous cost or reward.

[1]  장윤희,et al.  Y. , 2003, Industrial and Labor Relations Terms.

[2]  Le Song,et al.  SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[3]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[4]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[5]  Marcello Restelli,et al.  Smoothing policies and safe policy gradients , 2019, Machine Learning.

[6]  Qi Cai,et al.  Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[7]  Matthieu Geist,et al.  Local Policy Search in a Convex Space and Conservative Policy Iteration as Boosted Policy Search , 2014, ECML/PKDD.

[8]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[9]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[10]  Ofir Nachum,et al.  Path Consistency Learning in Tsallis Entropy Regularized MDPs , 2018, ICML.

[11]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[12]  A. Juditsky,et al.  5 First-Order Methods for Nonsmooth Convex Large-Scale Optimization , I : General Purpose Methods , 2010 .

[13]  S. Kakade,et al.  Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.

[14]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[15]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[16]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[17]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[18]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19]  Angelia Nedic,et al.  On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging , 2013, SIAM J. Optim..

[20]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[21]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[22]  Amir Beck,et al.  First-Order Methods in Optimization , 2017 .

[23]  Dale Schuurmans,et al.  Trust-PCL: An Off-Policy Trust Region Method for Continuous Control , 2017, ICLR.

[24]  Nicolas Le Roux,et al.  Understanding the impact of entropy on policy optimization , 2018, ICML.

[25]  Sham M. Kakade,et al.  On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[26]  Bruno Scherrer,et al.  Approximate Policy Iteration Schemes: A Comparison , 2014, ICML.

[27]  Luca Bascetta,et al.  Adaptive Step-Size for Policy Gradient Methods , 2013, NIPS.

[28]  Jalaj Bhandari,et al.  Global Optimality Guarantees For Policy Gradient Methods , 2019, ArXiv.

[29]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[30]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[31]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[32]  P. Alam ‘S’ , 2021, Composites Engineering: An A–Z Guide.

[33]  P. Alam,et al.  R , 1823, The Herodotus Encyclopedia.