Near Optimal Policy Optimization via REPS

Since its introduction a decade ago, relative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains, not to mention providing algorithmic components used by many recently proposed reinforcement learning (RL) algorithms. While REPS is commonly known in the community, there exist no guarantees on its performance when using stochastic and gradient-based solvers. In this paper we aim to fill this gap by providing guarantees and convergence rates for the sub-optimality of a policy learned using first-order optimization methods applied to the REPS objective. We first consider the setting in which we are given access to exact gradients and demonstrate how near-optimality of the objective translates to near-optimality of the policy. We then consider the practical setting of stochastic gradients, and introduce a technique that uses generative access to the underlying Markov decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.

[1]  Luís Paulo Reis,et al.  Model-Based Relative Entropy Stochastic Search , 2016, NIPS.

[2]  Gergely Neu,et al.  Logistic $Q$-Learning , 2020, AISTATS.

[3]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[4]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[5]  Gergely Neu,et al.  Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[6]  Jon D. McAuliffe,et al.  Uniform, nonparametric, non-asymptotic confidence sequences , 2018 .

[7]  Bo Dai,et al.  Reinforcement Learning via Fenchel-Rockafellar Duality , 2020, ArXiv.

[8]  Mengdi Wang,et al.  Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Running Time , 2017, 1704.01869.

[9]  Jan Peters,et al.  Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[10]  Ilya Kostrikov,et al.  Imitation Learning via Off-Policy Distribution Matching , 2019, ICLR.

[11]  Mengdi Wang,et al.  Primal-Dual π Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems , 2017, ArXiv.

[12]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[13]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[14]  Byron Boots,et al.  A Reduction from Reinforcement Learning to No-Regret Online Learning , 2020, AISTATS.

[15]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[16]  Bruno Scherrer,et al.  Leverage the Average: an Analysis of Regularization in RL , 2020, ArXiv.

[17]  Aaron Sidford,et al.  Efficiently Solving MDPs with Stochastic Mirror Descent , 2020, ICML.

[18]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[19]  Ilya Kostrikov,et al.  AlgaeDICE: Policy Gradient from Arbitrary Experience , 2019, ArXiv.

[20]  Jan Peters,et al.  Relative Entropy Inverse Reinforcement Learning , 2011, AISTATS.

[21]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[22]  Mengdi Wang,et al.  Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning , 2016, ArXiv.

[23]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[24]  Jan Peters,et al.  f-Divergence constrained policy improvement , 2017, ArXiv.

[25]  Zeyuan Allen Zhu,et al.  Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent , 2014, ITCS.

[26]  Gergely Neu,et al.  Faster saddle-point optimization for solving large-scale Markov decision processes , 2020, L4DC.

[27]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[28]  Lihong Li,et al.  Scalable Bilinear π Learning Using State and Action Features , 2018, ICML 2018.

[29]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..