Global Convergence of Policy Gradient Methods for Linearized Control Problems

Direct policy gradient methods for reinforcement learning and continuous control problems are a popular approach for a variety of reasons: 1) they are easy to implement without explicit knowledge of the underlying model 2) they are an "end-to-end" approach, directly optimizing the performance metric of interest 3) they inherently allow for richly parameterized policies. A notable drawback is that even in the most basic continuous control problem (that of linear quadratic regulators), these methods must solve a non-convex optimization problem, where little is understood about their efficiency from both computational and statistical perspectives. In contrast, system identification and model based planning in optimal control theory have a much more solid theoretical footing, where much is known with regards to their computational and statistical properties. This work bridges this gap showing that (model free) policy gradient methods globally converge to the optimal solution and are efficient (polynomially so in relevant problem dependent quantities) with regards to their sample and computational complexities.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  D. Kleinman On an iterative technique for Riccati equation computations , 1968 .

[3]  G. Hewer An iterative technique for the computation of the steady state gains for the discrete optimal regulator , 1971 .

[4]  E. Polak An historical survey of computational methods in optimal control. , 1973 .

[5]  B. Anderson,et al.  Optimal control: linear quadratic methods , 1990 .

[6]  L. Liao,et al.  Convergence in unconstrained discrete-time differential dynamic programming , 1991 .

[7]  Andrew G. Barto,et al.  Adaptive linear quadratic control using policy iteration , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[8]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[9]  Leiba Rodman,et al.  Algebraic Riccati equations , 1995 .

[10]  Claude-Nicolas Fiechter,et al.  PAC adaptive control of linear systems , 1997, COLT '97.

[11]  E. Tyrtyshnikov A brief introduction to numerical analysis , 1997 .

[12]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[13]  Lennart Ljung,et al.  System identification (2nd ed.): theory for the user , 1999 .

[14]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[15]  Duan Li,et al.  A Globally Convergent and Efficient Method for Unconstrained Discrete-Time Optimal Control , 2002, J. Glob. Optim..

[16]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[17]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[18]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[19]  Venkataramanan Balakrishnan,et al.  Semidefinite programming duality and linear time-invariant systems , 2003, IEEE Trans. Autom. Control..

[20]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[21]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[22]  E. Todorov,et al.  A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems , 2005, Proceedings of the 2005, American Control Conference, 2005..

[23]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[24]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[25]  Anders Rantzer,et al.  Gradient methods for iterative distributed control synthesis , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[26]  John L. Nazareth,et al.  Introduction to derivative-free optimization , 2010, Math. Comput..

[27]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[28]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[29]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[30]  Yuval Tassa,et al.  Synthesis and stabilization of complex behaviors through online trajectory optimization , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[31]  Karl Mårtensson,et al.  Gradient Methods for Large-Scale and Distributed Linear Quadratic Control , 2012 .

[32]  Aaron Hertzmann,et al.  Trajectory Optimization for Full-Body Movements with Complex Contacts , 2013, IEEE Transactions on Visualization and Computer Graphics.

[33]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[34]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[35]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[36]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[37]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[38]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[39]  Sergey Levine,et al.  Optimal control with learned local models: Application to dexterous manipulation , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[40]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[41]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[42]  Yurii Nesterov,et al.  Random Gradient-Free Minimization of Convex Functions , 2015, Foundations of Computational Mathematics.

[43]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[44]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[45]  Sham M. Kakade,et al.  Towards Generalization and Simplicity in Continuous Control , 2017, NIPS.

[46]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[47]  Sanjeev Arora,et al.  Towards Provable Control for Unknown Linear Dynamical Systems , 2018, International Conference on Learning Representations.

[48]  Benjamin Recht,et al.  Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator , 2017, ICML.

[49]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[50]  Yi Zhang,et al.  Spectral Filtering for General Linear Dynamical Systems , 2018, NeurIPS.

[51]  Nikolai Matni,et al.  On the Sample Complexity of the Linear Quadratic Regulator , 2017, Foundations of Computational Mathematics.