Convergence and Sample Complexity of Gradient Methods for the Model-Free Linear–Quadratic Regulator Problem

Model-free reinforcement learning attempts to find an optimal control action for an unknown dynamical system by directly searching over the parameter space of controllers. The convergence behavior and statistical properties of these approaches are often poorly understood because of the nonconvex nature of the underlying optimization problems and the lack of exact gradient computation. In this paper, we take a step towards demystifying the performance and efficiency of such methods by focusing on the standard infinite-horizon linear quadratic regulator problem for continuous-time systems with unknown state-space parameters. We establish exponential stability for the ordinary differential equation (ODE) that governs the gradient-flow dynamics over the set of stabilizing feedback gains and show that a similar result holds for the gradient descent method that arises from the forward Euler discretization of the corresponding ODE. We also provide theoretical bounds on the convergence rate and sample complexity of the random search method with two-point gradient estimates. We prove that the required simulation time for achieving $\epsilon$-accuracy in the model-free setup and the total number of function evaluations both scale as $\log \, (1/\epsilon)$.

[1]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[2]  D. Kleinman On an iterative technique for Riccati equation computations , 1968 .

[3]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[4]  Mihailo R. Jovanovic,et al.  Controller architectures: Tradeoffs between performance and structure , 2016, Eur. J. Control.

[5]  Mikhail V. Khlebnikov,et al.  An LMI approach to structured sparse feedback design in linear control systems , 2013, 2013 European Control Conference (ECC).

[6]  Sham M. Kakade,et al.  Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.

[7]  M. Fardad,et al.  Sparsity-promoting optimal control for a class of distributed systems , 2011, Proceedings of the 2011 American Control Conference.

[8]  Benjamin Recht,et al.  A Tour of Reinforcement Learning: The View from Continuous Control , 2018, Annu. Rev. Control. Robotics Auton. Syst..

[9]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[10]  Stephen P. Boyd,et al.  Numerical Methods for H 2 Related Problems , 1992 .

[11]  Zhi-Quan Luo,et al.  An ADMM algorithm for optimal sensor and actuator selection , 2014, 53rd IEEE Conference on Decision and Control.

[12]  H. Toivonen A globally convergent algorithm for the optimal constant output feedback problem , 1985 .

[13]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[14]  W. Marsden I and J , 2012 .

[15]  Venkataramanan Balakrishnan,et al.  Semidefinite programming duality and linear time-invariant systems , 2003, IEEE Trans. Autom. Control..

[16]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[17]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[18]  Hannu T. Toivonen,et al.  Newton's method for solving parametric linear quadratic control problems , 1987 .

[19]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[20]  Tamer Basar,et al.  Policy Optimization for H2 Linear Control with H∞ Robustness Guarantee: Implicit Regularization and Global Convergence , 2020, L4DC.

[21]  G. Dullerud,et al.  A Course in Robust Control Theory: A Convex Approach , 2005 .

[22]  Nikolai Matni,et al.  On the Sample Complexity of the Linear Quadratic Regulator , 2017, Foundations of Computational Mathematics.

[23]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[24]  Ekkehard W. Sachs,et al.  Computational Design of Optimal Output Feedback Controllers , 1997, SIAM J. Optim..

[25]  Tryphon T. Georgiou,et al.  Proximal Algorithms for Large-Scale Statistical Modeling and Sensor/Actuator Selection , 2018, IEEE Transactions on Automatic Control.

[26]  Mehran Mesbahi,et al.  LQR through the Lens of First Order Methods: Discrete-time Case , 2019, ArXiv.

[27]  M. Athans,et al.  On the determination of the optimal constant output feedback gains for linear multivariable systems , 1970 .

[28]  Boris Polyak,et al.  Optimizing Static Linear Feedback: Gradient Method , 2020, SIAM J. Control. Optim..

[29]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[30]  J. Ackermann Parameter space design of robust control systems , 1980 .

[31]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[32]  Jose C. Geromel,et al.  An alternate numerical solution to the linear quadratic problem , 1994, IEEE Trans. Autom. Control..

[33]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[34]  Nevena Lazic,et al.  Model-Free Linear Quadratic Control via Reduction to Expert Prediction , 2018, AISTATS.

[35]  Michael I. Jordan,et al.  Learning Without Mixing: Towards A Sharp Analysis of Linear System Identification , 2018, COLT.

[36]  M. Vidyasagar,et al.  Maximal Lyapunov Functions and Domains of Attraction for Autonomous Nonlinear Systems , 1981 .

[37]  Avi Wigderson,et al.  Sum-of-Squares Lower Bounds for Sparse PCA , 2015, NIPS.

[38]  S. Bittanti,et al.  The Riccati equation , 1991 .

[39]  Bin Hu,et al.  Convergence Guarantees of Policy Optimization Methods for Markovian Jump Linear Systems , 2020, 2020 American Control Conference (ACC).

[40]  Sham M. Kakade,et al.  Global Convergence of Policy Gradient Methods for Linearized Control Problems , 2018, ICML 2018.

[41]  Mahdi Soltanolkotabi,et al.  Random search for learning the linear quadratic regulator , 2020, 2020 American Control Conference (ACC).

[42]  Benjamin Recht,et al.  Simple random search provides a competitive approach to reinforcement learning , 2018, ArXiv.

[43]  Maryam Kamgarpour,et al.  Learning the Globally Optimal Distributed LQ Regulator , 2019, L4DC.

[44]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[45]  P. Olver Nonlinear Systems , 2013 .

[46]  Fu Lin,et al.  Augmented Lagrangian Approach to Design of Structured Optimal State Feedback Gains , 2011, IEEE Transactions on Automatic Control.

[47]  Kaiqing Zhang,et al.  Policy Optimization for H2 Linear Control with H∞ Robustness Guarantee: Implicit Regularization and Global Convergence , 2019, SIAM J. Control. Optim..

[48]  Fu Lin,et al.  Design of Optimal Sparse Feedback Gains via the Alternating Direction Method of Multipliers , 2011, IEEE Transactions on Automatic Control.

[49]  B. Anderson,et al.  Optimal control: linear quadratic methods , 1990 .

[50]  Martin J. Wainwright,et al.  Derivative-Free Methods for Policy Optimization: Guarantees for Linear Quadratic Systems , 2018, AISTATS.

[51]  Mahdi Soltanolkotabi,et al.  On the Linear Convergence of Random Search for Discrete-Time LQR , 2021, IEEE Control Systems Letters.