Reinforcement Learning Policies in Continuous-Time Linear Systems

Linear dynamical systems that obey stochastic differential equations are canonical models. While optimal control of known systems has a rich literature, the problem is technically hard under model uncertainty and there are hardly any results. We initiate study of this problem and aim to learn (and simultaneously deploy) optimal actions for minimizing a quadratic cost function. Indeed, this work is the first that comprehensively addresses the crucial challenge of balancing exploration versus exploitation in continuous-time systems. We present online policies that learn optimal actions fast by carefully randomizing the parameter estimates, and establish their performance guarantees: a regret bound that grows with square-root of time multiplied by the number of parameters. Implementation of the policy for a flight-control task demonstrates its efficacy. Further, we prove sharp stability results for inexact system dynamics and tightly specify the infinitesimal regret caused by sub-optimal actions. To obtain the results, we conduct a novel eigenvalue-sensitivity analysis for matrix perturbation, establish upper-bounds for comparative ratios of stochastic integrals, and introduce the new method of policy differentiation. Our analysis sheds light on fundamental challenges in continuous-time reinforcement learning and suggests a useful cornerstone for similar problems.

[1]  C. Ebenbauer,et al.  Fast identification and stabilization of unknown linear systems , 2022, ArXiv.

[2]  Mohamad Kazem Shirani Faradonbeh,et al.  Thompson Sampling Efficiently Learns to Control Diffusion Processes , 2022, NeurIPS.

[3]  Mohamad Kazem Shirani Faradonbeh,et al.  Bayesian Algorithms Learn to Stabilize Unknown Continuous-Time Systems , 2021, IFAC-PapersOnLine.

[4]  Aditya Mahajan,et al.  A modified Thompson sampling-based learning algorithm for unknown linear systems , 2021, 2022 IEEE 61st Conference on Decision and Control (CDC).

[5]  Henrik Sandberg,et al.  On a Phase Transition of Regret in Linear Quadratic Control: The Memoryless Case , 2021, IEEE Control Systems Letters.

[6]  Kamyar Azizzadenesheli,et al.  Explore More and Improve Regret in Linear Quadratic Regulators , 2020, ArXiv.

[7]  Elad Hazan,et al.  Black-Box Control for Linear Dynamical Systems , 2020, COLT.

[8]  Xin Guo,et al.  Logarithmic Regret for Episodic Continuous-Time Linear-Quadratic Reinforcement Learning Over a Finite-Time Horizon , 2020, J. Mach. Learn. Res..

[9]  Babak Hassibi,et al.  Logarithmic Regret Bound in Partially Observable Linear Dynamical Systems , 2020, NeurIPS.

[10]  Henrik Sandberg,et al.  Regret Lower Bounds for Unbiased Adaptive Control of Linear Quadratic Regulators , 2020, IEEE Control Systems Letters.

[11]  Alon Cohen,et al.  Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently , 2020, ICML.

[12]  Seyed Mohammad Asghari,et al.  Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical Systems , 2020, UAI.

[13]  Ambuj Tewari,et al.  Randomized Algorithms for Data-Driven Stabilization of Stochastic Linear Systems , 2019, 2019 IEEE Data Science Workshop (DSW).

[14]  Ambuj Tewari,et al.  On Applications of Bootstrap in Continuous Space Reinforcement Learning , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[15]  Peter E. Caines,et al.  Stochastic ε-Optimal Linear Quadratic Adaptation: An Alternating Controls Policy , 2019, SIAM J. Control. Optim..

[16]  Ambuj Tewari,et al.  Input perturbations for adaptive control and learning , 2018, Autom..

[17]  Ambuj Tewari,et al.  Finite Time Adaptive Stabilization of LQ Systems , 2018, ArXiv.

[18]  Alessandro Lazaric,et al.  Improved Regret Bounds for Thompson Sampling in Linear Quadratic Control Problems , 2018, ICML.

[19]  Mohamad Kazem Shirani Faradonbeh,et al.  On adaptive Linear-Quadratic regulators , 2018, Autom..

[20]  Zongli Lin,et al.  Output Feedback Reinforcement Learning Control for the Continuous-Time Linear Quadratic Regulator Problem , 2018, 2018 Annual American Control Conference (ACC).

[21]  Mohamad Kazem Shirani Faradonbeh,et al.  Optimism-Based Adaptive Regulation of Linear-Quadratic Systems , 2017, IEEE Transactions on Automatic Control.

[22]  Nikolai Matni,et al.  On the Sample Complexity of the Linear Quadratic Regulator , 2017, Foundations of Computational Mathematics.

[23]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[24]  Michael Taksar,et al.  Stochastic Control in Insurance , 2010 .

[25]  Huyen Pham,et al.  Continuous-time stochastic control and optimization with financial applications / Huyen Pham , 2009 .

[26]  Daniel T Gillespie,et al.  Stochastic simulation of chemical kinetics. , 2007, Annual review of physical chemistry.

[27]  P. Caines,et al.  On persistent excitation for linear systems with stochastic coefficients , 2001, Proceedings of the 41st IEEE Conference on Decision and Control, 2002..

[28]  Lei Guo,et al.  Adaptive continuous-time linear quadratic Gaussian control , 1999, IEEE Trans. Autom. Control..

[29]  Guanrong Chen,et al.  Linear Stochastic Control Systems , 1995 .

[30]  P. Caines Continuous time stochastic adaptive control: non-explosion, e-consistency and stability , 1992 .

[31]  Petr Mandl,et al.  On the consistency of a least squares identification procedure , 1992, Kybernetika.

[32]  John T. Bosworth,et al.  Linearized aerodynamic and control law models of the X-29A airplane and comparison with flight data , 1992 .

[33]  B. Pasik-Duncan,et al.  Adaptive control of continuous-time linear stochastic systems , 1990, Math. Control. Signals Syst..

[34]  B. Øksendal Stochastic differential equations : an introduction with applications , 1987 .

[35]  G. Goodwin,et al.  Riccati equations in optimal filtering of nonstabilizable systems having singular state transition matrices , 1986 .

[36]  G. Goodwin,et al.  Convergence properties of the Riccati difference equation in optimal filtering of nonstabilizable systems , 1984 .

[37]  C. T. Fike,et al.  Norms and exclusion theorems , 1960 .

[38]  Xun Yu Zhou,et al.  Reinforcement Learning in Continuous Time and Space: A Stochastic Control Approach , 2020, J. Mach. Learn. Res..

[39]  G. Yin Stochastic systems: Estimation, identification, and adaptive control, Reprint of 1986 edition, Vol. 75, P.R. Kumar, Pravin Varaiya, in: Classics in applied mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA (2016), xvii+358, ISBN: 978-1-611974-25-6 , 2017, Autom..

[40]  Neil D. Lawrence,et al.  Learning and Inference in Computational Systems Biology , 2010, Computational molecular biology.

[41]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[42]  Petr Mandl,et al.  On least squares estimation in continuous time linear stochastic systems , 1992, Kybernetika.

[43]  P. Mandl Consistency of estimators in controlled systems , 1989 .