Variational Policy Gradient Method for Reinforcement Learning with General Utilities

In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. We prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, though the optimization problem is nonconvex. We also establish its rate of convergence of the order $O(1/t)$ by exploiting the hidden convexity of the problem, and proves that it converges exponentially when the problem admits hidden strong convexity. Our analysis applies to the standard RL problem with cumulative rewards as a special case, in which case our result improves the available convergence rate.

[1]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[2]  Peter W. Glynn,et al.  Probability Functional Descent: A Unifying Perspective on GANs, Variational Inference, and Reinforcement Learning , 2019, ICML.

[3]  E. Altman Constrained Markov Decision Processes , 1999 .

[4]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[5]  Mehran Mesbahi,et al.  LQR through the Lens of First Order Methods: Discrete-time Case , 2019, ArXiv.

[6]  Gergely Neu,et al.  Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[7]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[8]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[9]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[10]  Ying Huang,et al.  On Finding Optimal Policies for Markov Decision Chains: A Unifying Framework for Mean-Variance-Tradeoffs , 1994, Math. Oper. Res..

[11]  Ambuj Tewari,et al.  Regularization Techniques for Learning with Matrices , 2009, J. Mach. Learn. Res..

[12]  Sham M. Kakade,et al.  Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[13]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[14]  Zhaoran Wang,et al.  Neural Policy Gradient Methods: Global Optimality and Rates of Convergence , 2019, ICLR.

[15]  Luca Bascetta,et al.  Adaptive Step-Size for Policy Gradient Methods , 2013, NIPS.

[16]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[17]  Sham M. Kakade,et al.  Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.

[18]  John N. Tsitsiklis,et al.  Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[19]  Sham M. Kakade,et al.  On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[20]  B. V. Dean,et al.  Studies in Linear and Non-Linear Programming. , 1959 .

[21]  Long Ji Lin,et al.  Reinforcement Learning of Non-Markov Decision Processes , 1995, Artif. Intell..

[22]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[23]  Yaoliang Yu,et al.  A General Projection Property for Distribution Families , 2009, NIPS.

[24]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[25]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[26]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[27]  Quanquan Gu,et al.  Sample Efficient Policy Gradient Methods with Recursive Variance Reduction , 2020, ICLR.

[28]  Dmitriy Drusvyatskiy,et al.  Efficiency of minimizing compositions of convex functions and smooth maps , 2016, Math. Program..

[29]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[30]  Alec Koppel,et al.  Cautious Reinforcement Learning via Distributional Risk in the Dual Domain , 2020, IEEE Journal on Selected Areas in Information Theory.

[31]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[32]  Dale Schuurmans,et al.  On the Global Convergence Rates of Softmax Policy Gradient Methods , 2020, ICML.

[33]  Marcello Restelli,et al.  Stochastic Variance-Reduced Policy Gradient , 2018, ICML.

[34]  Mengdi Wang,et al.  Generalization Bounds for Stochastic Saddle Point Problems , 2020, AISTATS.

[35]  Bo Dai,et al.  Reinforcement Learning via Fenchel-Rockafellar Duality , 2020, ArXiv.

[36]  H. Robbins A Stochastic Approximation Method , 1951 .

[37]  Hao Zhu,et al.  Global Convergence of Policy Gradient Methods to (Almost) Locally Optimal Policies , 2019, SIAM J. Control. Optim..

[38]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[39]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[40]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[41]  Jalaj Bhandari,et al.  Global Optimality Guarantees For Policy Gradient Methods , 2019, ArXiv.

[42]  Qi Cai,et al.  Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[43]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[44]  Lodewijk C. M. Kallenberg,et al.  Survey of linear programming for standard and nonstandard Markovian control problems. Part I: Theory , 1994, Math. Methods Oper. Res..

[45]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[46]  John S. Edwards,et al.  Linear Programming and Finite Markovian Control Problems , 1983 .

[47]  Stefan Schaal,et al.  Learning from Demonstration , 1996, NIPS.

[48]  Sean P. Meyn,et al.  Risk-Sensitive Optimal Control for Markov Decision Processes with Monotone Cost , 2002, Math. Oper. Res..

[49]  L. Takács,et al.  Non-Markovian Processes , 1966 .

[50]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[51]  C. Derman,et al.  Some Remarks on Finite Horizon Markovian Decision Models , 1965 .

[52]  Luca Bascetta,et al.  Policy gradient in Lipschitz Markov Decision Processes , 2015, Machine Learning.

[53]  Jerzy A. Filar,et al.  Variance-Penalized Markov Decision Processes , 1989, Math. Oper. Res..

[54]  Sham M. Kakade,et al.  Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.