Online Reinforcement Learning in Stochastic Continuous-Time Systems

Linear dynamical systems that obey stochastic differential equations are canonical models. While optimal control of known systems has a rich literature, the problem is technically hard under model uncertainty and there are hardly any such result. We initiate study of this problem and aim to learn (and simultaneously deploy) optimal actions for minimizing a quadratic cost function. Indeed, this work is the first that comprehensively addresses the crucial challenge of balancing exploration versus exploitation in continuous-time systems. We present online policies that learn optimal actions fast by carefully randomizing the parameter estimates, and establish their performance guarantees: a regret bound that grows with square-root of time multiplied by the number of parameters . Implementation of the policy for a flight-control task demonstrates its efficacy. Further, we prove sharp stability results for inexact system dynamics and tightly specify the infinitesimal regret caused by sub-optimal actions. To obtain the results, we conduct a novel eigenvalue-sensitivity analysis for matrix perturbation, establish upper-bounds for comparative ratios of stochastic integrals, and introduce the new method of policy differentiation. Our analysis sheds light on fundamental challenges in continuous-time reinforcement learning and suggests a useful cornerstone for similar problems.

[1]  C. Ebenbauer,et al.  Fast identification and stabilization of unknown linear systems , 2022, ArXiv.

[2]  Mohamad Kazem Shirani Faradonbeh,et al.  Thompson Sampling Efficiently Learns to Control Diffusion Processes , 2022, NeurIPS.

[3]  Mohamad Kazem Shirani Faradonbeh,et al.  Bayesian Algorithms Learn to Stabilize Unknown Continuous-Time Systems , 2021, IFAC-PapersOnLine.

[4]  Henrik Sandberg,et al.  On a Phase Transition of Regret in Linear Quadratic Control: The Memoryless Case , 2021, IEEE Control Systems Letters.

[5]  Kamyar Azizzadenesheli,et al.  Explore More and Improve Regret in Linear Quadratic Regulators , 2020, ArXiv.

[6]  Elad Hazan,et al.  Black-Box Control for Linear Dynamical Systems , 2020, COLT.

[7]  Xin Guo,et al.  Logarithmic Regret for Episodic Continuous-Time Linear-Quadratic Reinforcement Learning Over a Finite-Time Horizon , 2020, J. Mach. Learn. Res..

[8]  Babak Hassibi,et al.  Logarithmic Regret Bound in Partially Observable Linear Dynamical Systems , 2020, NeurIPS.

[9]  Henrik Sandberg,et al.  Regret Lower Bounds for Unbiased Adaptive Control of Linear Quadratic Regulators , 2020, IEEE Control Systems Letters.

[10]  Alon Cohen,et al.  Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently , 2020, ICML.

[11]  Seyed Mohammad Asghari,et al.  Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical Systems , 2020, UAI.

[12]  Mohamad Kazem Shirani Faradonbeh,et al.  Finite-Time Adaptive Stabilization of Linear Systems , 2019, IEEE Transactions on Automatic Control.

[13]  Ambuj Tewari,et al.  Randomized Algorithms for Data-Driven Stabilization of Stochastic Linear Systems , 2019, 2019 IEEE Data Science Workshop (DSW).

[14]  Ambuj Tewari,et al.  On Applications of Bootstrap in Continuous Space Reinforcement Learning , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[15]  Peter E. Caines,et al.  Stochastic ε-Optimal Linear Quadratic Adaptation: An Alternating Controls Policy , 2019, SIAM J. Control. Optim..

[16]  Ambuj Tewari,et al.  Input perturbations for adaptive control and learning , 2018, Autom..

[17]  Alessandro Lazaric,et al.  Improved Regret Bounds for Thompson Sampling in Linear Quadratic Control Problems , 2018, ICML.

[18]  Mohamad Kazem Shirani Faradonbeh,et al.  On adaptive Linear-Quadratic regulators , 2018, Autom..

[19]  Zongli Lin,et al.  Output Feedback Reinforcement Learning Control for the Continuous-Time Linear Quadratic Regulator Problem , 2018, 2018 Annual American Control Conference (ACC).

[20]  Mohamad Kazem Shirani Faradonbeh,et al.  Optimism-Based Adaptive Regulation of Linear-Quadratic Systems , 2017, IEEE Transactions on Automatic Control.

[21]  Nikolai Matni,et al.  On the Sample Complexity of the Linear Quadratic Regulator , 2017, Foundations of Computational Mathematics.

[22]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[23]  Michael Taksar,et al.  Stochastic Control in Insurance , 2010 .

[24]  Daniel T Gillespie,et al.  Stochastic simulation of chemical kinetics. , 2007, Annual review of physical chemistry.

[25]  P. Caines,et al.  On persistent excitation for linear systems with stochastic coefficients , 2001, Proceedings of the 41st IEEE Conference on Decision and Control, 2002..

[26]  Lei Guo,et al.  Adaptive continuous-time linear quadratic Gaussian control , 1999, IEEE Trans. Autom. Control..

[27]  X. Zhou,et al.  Stochastic Controls: Hamiltonian Systems and HJB Equations , 1999 .

[28]  P. Caines Continuous time stochastic adaptive control: non-explosion, e-consistency and stability , 1992 .

[29]  Petr Mandl,et al.  On the consistency of a least squares identification procedure , 1992, Kybernetika.

[30]  John T. Bosworth,et al.  Linearized aerodynamic and control law models of the X-29A airplane and comparison with flight data , 1992 .

[31]  B. Pasik-Duncan,et al.  Adaptive control of continuous-time linear stochastic systems , 1990, Math. Control. Signals Syst..

[32]  B. Øksendal Stochastic differential equations : an introduction with applications , 1987 .

[33]  G. Goodwin,et al.  Riccati equations in optimal filtering of nonstabilizable systems having singular state transition matrices , 1986 .

[34]  G. Goodwin,et al.  Convergence properties of the Riccati difference equation in optimal filtering of nonstabilizable systems , 1984 .

[35]  Han-Fu Chen CONSISTENCY OF LEAST SQUARES IDENTIFICATION , 1981 .

[36]  C. T. Fike,et al.  Norms and exclusion theorems , 1960 .

[37]  Aditya Mahajan,et al.  A relaxed technical assumption for posterior sampling-based reinforcement learning for control of unknown linear systems , 2021, ArXiv.

[38]  Xun Yu Zhou,et al.  Reinforcement Learning in Continuous Time and Space: A Stochastic Control Approach , 2020, J. Mach. Learn. Res..

[39]  Neil D. Lawrence,et al.  Learning and Inference in Computational Systems Biology , 2010, Computational molecular biology.

[40]  K. Doya Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[41]  Petr Mandl,et al.  On least squares estimation in continuous time linear stochastic systems , 1992, Kybernetika.

[42]  P. Mandl Consistency of estimators in controlled systems , 1989 .

[43]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .