Linear Quadratic Reinforcement Learning: Sublinear Regret in the Episodic Continuous-Time Framework

This paper studies a continuous-time linear quadratic reinforcement learning problem in an episodic setting. We first show that naive discretization and piecewise approximation with discrete-time RL algorithms yields a linear regret with respect to the number of learning episodes $N$. We then propose an algorithm with continuous-time controls based on a regularized least-squares estimation. We establish a sublinear regret bound in the order of $\tilde O(N^{9/10})$. The analysis consists of two parts: parameter estimation error, which relies on properties of sub-exponential random variables and double stochastic integrals; and perturbation analysis, which establishes the robustness of the associated continuous-time Riccati equation by exploiting its regularity property. The regret bound for the one-dimensional case improves to $\tilde O(\sqrt{N})$.

[1]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[2]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[3]  V. N. Bogaevski,et al.  Matrix Perturbation Theory , 1991 .

[4]  Nikolai Matni,et al.  Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator , 2018, NeurIPS.

[5]  Yann Ollivier,et al.  Making Deep Q-learning methods robust to time discretization , 2019, ICML.

[6]  Rémi Munos,et al.  Reinforcement Learning for Continuous Stochastic Control Problems , 1997, NIPS.

[7]  Benjamin Recht,et al.  Certainty Equivalence is Efficient for Linear Quadratic Control , 2019, NeurIPS.

[8]  Lei Guo,et al.  Adaptive continuous-time linear quadratic Gaussian control , 1999, IEEE Trans. Autom. Control..

[9]  X. Zhou,et al.  Stochastic Controls: Hamiltonian Systems and HJB Equations , 1999 .

[10]  Benjamin Van Roy,et al.  Model-based Reinforcement Learning and the Eluder Dimension , 2014, NIPS.

[11]  Petr Mandl,et al.  On least squares estimation in continuous time linear stochastic systems , 1992, Kybernetika.

[12]  G. M.,et al.  Partial Differential Equations I , 2023, Applied Mathematical Sciences.

[13]  Yishay Mansour,et al.  Learning Linear-Quadratic Regulators Efficiently with only $\sqrt{T}$ Regret , 2019, ArXiv.

[14]  Zongli Lin,et al.  Output Feedback Reinforcement Learning Control for the Continuous-Time Linear Quadratic Regulator Problem , 2018, 2018 Annual American Control Conference (ACC).

[15]  Rémi Munos,et al.  A Study of Reinforcement Learning in the Continuous Case by the Means of Viscosity Solutions , 2000, Machine Learning.

[16]  Zhengtao Ding Adaptive control of linear systems , 2013 .

[17]  W. Fleming,et al.  Controlled Markov processes and viscosity solutions , 1992 .

[18]  Mark Veraar,et al.  The stochastic Fubini theorem revisited , 2012 .

[19]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[20]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[21]  Yishay Mansour,et al.  Learning Linear-Quadratic Regulators Efficiently with only $\sqrt{T}$ Regret , 2019, ICML.

[22]  H. Soner,et al.  Small time path behavior of double stochastic integrals and applications to stochastic control , 2005, math/0602453.

[23]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[24]  Rémi Munos,et al.  Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[25]  Claude-Nicolas Fiechter,et al.  PAC adaptive control of linear systems , 1997, COLT '97.

[26]  Frank L. Lewis,et al.  Linear Quadratic Tracking Control of Partially-Unknown Continuous-Time Systems Using Reinforcement Learning , 2014, IEEE Transactions on Automatic Control.

[27]  Benjamin Recht,et al.  Certainty Equivalent Control of LQR is Efficient , 2019, ArXiv.

[28]  Sham M. Kakade,et al.  Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.