Learning robust control for LQR systems with multiplicative noise via policy gradient

The linear quadratic regulator (LQR) problem has reemerged as an important theoretical benchmark for reinforcement learning-based control of complex dynamical systems with continuous state and action spaces. In contrast with nearly all recent work in this area, we consider multiplicative noise models, which are increasingly relevant because they explicitly incorporate inherent uncertainty and variation in the system dynamics and thereby improve robustness properties of the controller. Robustness is a critical and poorly understood issue in reinforcement learning; existing methods which do not account for uncertainty can converge to fragile policies or fail to converge at all. Additionally, intentional injection of multiplicative noise into learning algorithms can enhance robustness of policies, as observed in ad hoc work on domain randomization. Although policy gradient algorithms require optimization of a non-convex cost function, we show that the multiplicative noise LQR cost has a special property called gradient domination, which is exploited to prove global convergence of policy gradient algorithms to the globally optimum control policy with polynomial dependence on problem parameters. Results are provided both in the model-known and model-unknown settings where samples of system trajectories are used to estimate policy gradients.

[1]  Alessandro Lazaric,et al.  Improved Regret Bounds for Thompson Sampling in Linear Quadratic Control Problems , 2018, ICML.

[2]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[3]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[4]  J. Lumley Stochastic tools in turbulence , 1970 .

[5]  João Pedro Hespanha,et al.  A Survey of Recent Results in Networked Control Systems , 2007, Proceedings of the IEEE.

[6]  Mehran Mesbahi,et al.  LQR through the Lens of First Order Methods: Discrete-time Case , 2019, ArXiv.

[7]  M. Athans,et al.  The uncertainty threshold principle: Fundamental limitations of optimal decision making under dynamic uncertainty , 1976 .

[8]  Benjamin Recht,et al.  Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator , 2017, ICML.

[9]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[10]  Peter Benner,et al.  Lyapunov Equations, Energy Functionals, and Model Order Reduction of Bilinear and Stochastic Systems , 2011, SIAM J. Control. Optim..

[11]  M. Breakspear Dynamic models of large-scale brain activity , 2017, Nature Neuroscience.

[12]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[13]  E. Yaz Linear Matrix Inequalities In System And Control Theory , 1998, Proceedings of the IEEE.

[14]  Giacomo Baggio,et al.  Data-Driven Minimum-Energy Controls for Linear Systems , 2019, IEEE Control Systems Letters.

[15]  F. Lewis,et al.  Reinforcement Learning and Feedback Control: Using Natural Decision Methods to Design Optimal Adaptive Controllers , 2012, IEEE Control Systems.

[16]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[17]  Thomas B. Schön,et al.  Learning convex bounds for linear quadratic control policy synthesis , 2018, NeurIPS.

[18]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[19]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[20]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[21]  D. Bernstein Robust static and dynamic output-feedback stabilization: Deterministic and stochastic perspectives , 1987 .

[22]  Peyman Mohajerin Esfahani,et al.  Robust Control Design for Linear Systems via Multiplicative Noise , 2020, 2004.08019.

[23]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[24]  D. Hinrichsen,et al.  Stochastic H∞ , 1998 .

[25]  E. Todorov,et al.  Estimation and control of systems with multiplicative noise via linear matrix inequalities , 2005, Proceedings of the 2005, American Control Conference, 2005..

[26]  Bin Hu,et al.  Convergence Guarantees of Policy Optimization Methods for Markovian Jump Linear Systems , 2020, 2020 American Control Conference (ACC).

[27]  Panos J. Antsaklis,et al.  Special Issue on Technology of Networked Control Systems , 2007 .

[28]  Bassam Bamieh,et al.  An Input–Output Approach to Structured Stochastic Uncertainty , 2018, IEEE Transactions on Automatic Control.

[29]  Paolo Rapisarda,et al.  Data-driven control: A behavioral approach , 2017, Syst. Control. Lett..

[30]  Gerhard Freiling,et al.  Properties of the solutions of rational matrix difference equations , 2003 .

[31]  Andrew G. Barto,et al.  Adaptive linear quadratic control using policy iteration , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[32]  Jer-Nan Juang,et al.  An eigensystem realization algorithm for modal parameter identification and model reduction. [control systems design for large space structures] , 1985 .

[33]  Gabriela Hug,et al.  Foundations and Challenges of Low-Inertia Systems (Invited Paper) , 2018, 2018 Power Systems Computation Conference (PSCC).

[34]  Ivan G. Ivanov,et al.  Properties of Stein (Lyapunov) iterations for solving a general Riccati equation , 2007 .

[35]  Benjamin Recht,et al.  Certainty Equivalent Control of LQR is Efficient , 2019, ArXiv.

[36]  Sham M. Kakade,et al.  Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.

[37]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[38]  Zhengtao Ding Adaptive control of linear systems , 2013 .

[39]  Nikolai Matni,et al.  On the Sample Complexity of the Linear Quadratic Regulator , 2017, Foundations of Computational Mathematics.

[40]  W. Wonham Optimal Stationary Control of a Linear System with State-Dependent Noise , 1967 .

[41]  G. Hewer An iterative technique for the computation of the steady state gains for the discrete optimal regulator , 1971 .

[42]  Stephen P. Boyd,et al.  Linear Matrix Inequalities in Systems and Control Theory , 1994 .

[43]  E. Tyrtyshnikov A brief introduction to numerical analysis , 1997 .

[44]  Pietro Tesi,et al.  On Persistency of Excitation and Formulas for Data-driven Control , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[45]  Jan T. Bialasiewicz,et al.  Power-Electronic Systems for the Grid Integration of Renewable Energy Sources: A Survey , 2006, IEEE Transactions on Industrial Electronics.

[46]  Claude-Nicolas Fiechter,et al.  PAC adaptive control of linear systems , 1997, COLT '97.

[47]  Jan C. Willems,et al.  Feedback stabilizability for stochastic systems with state and control dependent noise , 1976, Autom..

[48]  Benjamin Recht,et al.  A Tour of Reinforcement Learning: The View from Continuous Control , 2018, Annu. Rev. Control. Robotics Auton. Syst..

[49]  Peter J. Seiler,et al.  Recovering Robustness in Model-Free Reinforcement Learning , 2018, 2019 American Control Conference (ACC).

[50]  T. Damm Rational Matrix Equations in Stochastic Control , 2004 .

[51]  Pantelis Sopasakis,et al.  Safe Learning-Based Control of Stochastic Jump Linear Systems: a Distributionally Robust Approach , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[52]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[53]  Frank Kozin,et al.  A survey of stability of stochastic systems , 1969, Autom..

[54]  Xingkang He,et al.  Linear System Identification Under Multiplicative Noise from Multiple Trajectory Data , 2020, 2020 American Control Conference (ACC).

[55]  L. Ghaoui State-feedback control of systems with multiplicative noise via linear matrix inequalities , 1995 .

[56]  Alexandre S. Bazanella,et al.  Data-Driven LQR Control Design , 2018, IEEE Control Systems Letters.

[57]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[58]  Boris Polyak Gradient methods for the minimisation of functionals , 1963 .

[59]  M. Athans,et al.  Further results on the uncertainty threshold principle , 1977 .