Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes

Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class); how they cope with approximation error due to using a restricted class of parametric policies; or their finite sample behavior. Such characterizations are important not only to compare these methods to their approximate value function counterparts (where such issues are relatively well understood, at least in the worst case), but also to help with more principled approaches to algorithm design. This work provides provable characterizations of computational, approximation, and sample size issues with regards to policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: 1) "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy, and 2) restricted policy classes, which may not contain the optimal policy and where we provide agnostic learning results. One insight of this work is in formalizing the importance how a favorable initial state distribution provides a means to circumvent worst-case exploration issues. Overall, these results place policy gradient methods under a solid theoretical footing, analogous to the global convergence guarantees of iterative value function based algorithms.

[1]  R. Bellman,et al.  FUNCTIONAL APPROXIMATIONS AND DYNAMIC PROGRAMMING , 1959 .

[2]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[3]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[4]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[5]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[6]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[7]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[8]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[9]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[10]  Jeff G. Schneider,et al.  Covariant policy search , 2003, IJCAI 2003.

[11]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[12]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[13]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[14]  Csaba Szepesvári,et al.  Finite time bounds for sampling based fitted value iteration , 2005, ICML.

[15]  Rémi Munos,et al.  Error Bounds for Approximate Value Iteration , 2005, AAAI.

[16]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[17]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[18]  Csaba Szepesvári,et al.  Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.

[19]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[20]  Adrian S. Lewis,et al.  The [barred L]ojasiewicz Inequality for Nonsmooth Subanalytic Functions with Applications to Subgradient Dynamical Systems , 2006, SIAM J. Optim..

[21]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[22]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[23]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[24]  Hédy Attouch,et al.  Proximal Alternating Minimization and Projection Methods for Nonconvex Problems: An Approach Based on the Kurdyka-Lojasiewicz Inequality , 2008, Math. Oper. Res..

[25]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[26]  Hilbert J. Kappen,et al.  Dynamic policy programming , 2010, J. Mach. Learn. Res..

[27]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[28]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[29]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[30]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[31]  Matthieu Geist,et al.  Local Policy Search in a Convex Space and Conservative Policy Iteration as Boosted Policy Search , 2014, ECML/PKDD.

[32]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[33]  Bruno Scherrer,et al.  Approximate Policy Iteration Schemes: A Comparison , 2014, ICML.

[34]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[35]  Matthieu Geist,et al.  Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..

[36]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[37]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[38]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[39]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[40]  Prateek Jain,et al.  Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging , 2016, ArXiv.

[41]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[42]  Amir Beck,et al.  First-Order Methods in Optimization , 2017 .

[43]  Sham M. Kakade,et al.  Towards Generalization and Simplicity in Continuous Control , 2017, NIPS.

[44]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[45]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[46]  Prateek Jain,et al.  A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares) , 2017, FSTTCS.

[47]  Nan Jiang,et al.  On Oracle-Efficient PAC RL with Rich Observations , 2018, NeurIPS.

[48]  Sham M. Kakade,et al.  Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.

[49]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[50]  Qi Cai,et al.  Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[51]  Peter L. Bartlett,et al.  POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.

[52]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[53]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[54]  Nicolas Le Roux,et al.  Understanding the impact of entropy on policy optimization , 2018, ICML.

[55]  Jalaj Bhandari,et al.  Global Optimality Guarantees For Policy Gradient Methods , 2019, ArXiv.

[56]  J. Lee,et al.  Neural Temporal-Difference Learning Converges to Global Optima , 2019, NeurIPS.