Global Optimality Guarantees For Policy Gradient Methods

Policy gradients methods are perhaps the most widely used class of reinforcement learning algorithms. These methods apply to complex, poorly understood, control problems by performing stochastic gradient descent over a parameterized class of polices. Unfortunately, even for simple control problems solvable by classical techniques, policy gradient algorithms face non-convex optimization problems and are widely understood to converge only to local minima. This work identifies structural properties -- shared by finite MDPs and several classic control problems -- which guarantee that policy gradient objective function has no suboptimal local minima despite being non-convex. When these assumptions are relaxed, our work gives conditions under which any local minimum is near-optimal, where the error bound depends on a notion of the expressive capacity of the policy class.

[1]  J. Kiefer,et al.  Sequential minimax search for a maximum , 1953 .

[2]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[3]  Boris Polyak Gradient methods for the minimisation of functionals , 1963 .

[4]  W. Rudin Principles of mathematical analysis , 1964 .

[5]  D. Kleinman On an iterative technique for Riccati equation computations , 1968 .

[6]  G. Hewer An iterative technique for the computation of the steady state gains for the discrete optimal regulator , 1971 .

[7]  Ward Whitt,et al.  Approximations of Dynamic Programs, I , 1978, Math. Oper. Res..

[8]  A. Peressini,et al.  The Mathematics Of Nonlinear Programming , 1988 .

[9]  P. Glynn,et al.  Stochastic Optimization by Simulation: Convergence Proofs for the GI/G/1 Queue in Steady-State , 1994 .

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  K. Loparo,et al.  Inequalities for the trace of matrix product , 1994, IEEE Trans. Autom. Control..

[12]  John Rust Using Randomization to Break the Curse of Dimensionality , 1997 .

[13]  Andrew G. Barto,et al.  Adaptive linear quadratic control using policy iteration , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[14]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[15]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[16]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[17]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[18]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[19]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[20]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[21]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[22]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[23]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[24]  John N. Tsitsiklis,et al.  Regression methods for pricing complex American-style options , 2001, IEEE Trans. Neural Networks.

[25]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[26]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[27]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[28]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[29]  Doina Precup,et al.  Metrics for Finite Markov Decision Processes , 2004, AAAI.

[30]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[31]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[32]  Itir Z. Karaesmen,et al.  Overbooking with Substitutable Inventory Classes , 2004, Oper. Res..

[33]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[34]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[35]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[36]  Csaba Szepesvári,et al.  Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.

[37]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[38]  Benjamin Van Roy Performance Loss Bounds for Approximate Value Iteration with State Aggregation , 2006, Math. Oper. Res..

[39]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[40]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[41]  Martin A. Riedmiller,et al.  Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[42]  Huseyin Topaloglu,et al.  Using Stochastic Approximation Methods to Compute Optimal Base-Stock Levels in Inventory Control Problems , 2008, Oper. Res..

[43]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[44]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[45]  Garrett J. van Ryzin,et al.  Simulation-Based Optimization of Virtual Nesting Controls for Network Revenue Management , 2008, Oper. Res..

[46]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[47]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[48]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[49]  Robert Babuska,et al.  A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[50]  Ronald Ortner,et al.  Online Regret Bounds for Undiscounted Continuous Reinforcement Learning , 2012, NIPS.

[51]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[52]  Dimitri P. Bertsekas,et al.  Abstract Dynamic Programming , 2013 .

[53]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[54]  Matthieu Geist,et al.  Local Policy Search in a Convex Space and Conservative Policy Iteration as Boosted Policy Search , 2014, ECML/PKDD.

[55]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[56]  Woonghee Tim Huh,et al.  Online Sequential Optimization with Biased Gradients: Theory and Applications to Censored Demand , 2014, INFORMS J. Comput..

[57]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[58]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[59]  Bruno Scherrer,et al.  Approximate Policy Iteration Schemes: A Comparison , 2014, ICML.

[60]  Sayan Mukherjee,et al.  The Information Geometry of Mirror Descent , 2013, IEEE Transactions on Information Theory.

[61]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[62]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[63]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[64]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[65]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Math. Program..

[66]  Nathan Srebro,et al.  Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[67]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[68]  Alexander J. Smola,et al.  Stochastic Frank-Wolfe methods for nonconvex optimization , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[69]  Alexander J. Smola,et al.  Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization , 2016, NIPS.

[70]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[71]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[72]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[73]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[74]  J Reddi Sashank,et al.  Stochastic Frank-Wolfe methods for nonconvex optimization , 2016 .

[75]  John Wright,et al.  Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture , 2015, IEEE Transactions on Information Theory.

[76]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[77]  Matthieu Geist,et al.  Is the Bellman residual a bad proxy? , 2016, NIPS.

[78]  Dale Schuurmans,et al.  Improving Policy Gradient by Exploring Under-appreciated Rewards , 2016, ICLR.

[79]  Sham M. Kakade,et al.  Towards Generalization and Simplicity in Continuous Control , 2017, NIPS.

[80]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[81]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[82]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[83]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[84]  Yurii Nesterov,et al.  Lectures on Convex Optimization , 2018 .

[85]  Sham M. Kakade,et al.  Global Convergence of Policy Gradient Methods for Linearized Control Problems , 2018, ICML 2018.

[86]  Benjamin Recht,et al.  Simple random search of static linear policies is competitive for reinforcement learning , 2018, NeurIPS.

[87]  Sham M. Kakade,et al.  Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.

[88]  Yair Carmon,et al.  Accelerated Methods for NonConvex Optimization , 2018, SIAM J. Optim..

[89]  Damek Davis,et al.  Proximally Guided Stochastic Subgradient Method for Nonsmooth, Nonconvex Problems , 2017, SIAM J. Optim..

[90]  Dimitri P. Bertsekas,et al.  Feature-based aggregation and deep reinforcement learning: a survey and some new implementations , 2018, IEEE/CAA Journal of Automatica Sinica.

[91]  Zheng Wen,et al.  Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[92]  Dmitriy Drusvyatskiy,et al.  Stochastic Subgradient Method Converges on Tame Functions , 2018, Foundations of Computational Mathematics.

[93]  S. Kakade,et al.  Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.

[94]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .