Logarithmic Regret for Episodic Continuous-Time Linear-Quadratic Reinforcement Learning Over a Finite-Time Horizon

We study finite-time horizon continuous-time linear-quadratic reinforcement learning problems in an episodic setting, where both the state and control coefficients are unknown to the controller. We first propose a least-squares algorithm based on continuous-time observations and controls, and establish a logarithmic regret bound of order $O((\ln M)(\ln\ln M))$, with $M$ being the number of learning episodes. The analysis consists of two parts: perturbation analysis, which exploits the regularity and robustness of the associated Riccati differential equation; and parameter estimation error, which relies on sub-exponential properties of continuous-time least-squares estimators. We further propose a practically implementable least-squares algorithm based on discrete-time observations and piecewise constant controls, which achieves similar logarithmic regret with an additional term depending explicitly on the time stepsizes used in the algorithm.

[1]  Alon Cohen,et al.  Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently , 2020, ICML.

[2]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[3]  Yishay Mansour,et al.  Learning Linear-Quadratic Regulators Efficiently with only $\sqrt{T}$ Regret , 2019, ICML.

[4]  Rémi Munos,et al.  A Study of Reinforcement Learning in the Continuous Case by the Means of Viscosity Solutions , 2000, Machine Learning.

[5]  Max Simchowitz,et al.  Naive Exploration is Optimal for Online LQR , 2020, ICML.

[6]  Benjamin Recht,et al.  Certainty Equivalent Control of LQR is Efficient , 2019, ArXiv.

[7]  Jessica Fuerst,et al.  Stochastic Differential Equations And Applications , 2016 .

[8]  Sham M. Kakade,et al.  Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.

[9]  Zongli Lin,et al.  Output Feedback Reinforcement Learning Control for the Continuous-Time Linear Quadratic Regulator Problem , 2018, 2018 Annual American Control Conference (ACC).

[10]  Yann Ollivier,et al.  Making Deep Q-learning methods robust to time discretization , 2019, ICML.

[11]  Rémi Munos,et al.  Reinforcement Learning for Continuous Stochastic Control Problems , 1997, NIPS.

[12]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[13]  Mark Veraar,et al.  The stochastic Fubini theorem revisited , 2012 .

[14]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[15]  H. Soner,et al.  Small time path behavior of double stochastic integrals and applications to stochastic control , 2005, math/0602453.

[16]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[17]  Robert R. Bitmead,et al.  Riccati Difference and Differential Equations: Convergence, Monotonicity and Stability , 1991 .

[18]  Frank L. Lewis,et al.  Linear Quadratic Tracking Control of Partially-Unknown Continuous-Time Systems Using Reinforcement Learning , 2014, IEEE Transactions on Automatic Control.

[19]  Nikolai Matni,et al.  On the Sample Complexity of the Linear Quadratic Regulator , 2017, Foundations of Computational Mathematics.

[20]  B. Hambly,et al.  Policy Gradient Methods for the Noisy Linear Quadratic Regulator over a Finite Horizon , 2020, ArXiv.

[21]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[22]  X. Zhou,et al.  Stochastic Controls: Hamiltonian Systems and HJB Equations , 1999 .

[23]  Rémi Munos,et al.  Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[24]  Max Simchowitz,et al.  Logarithmic Regret for Adversarial Online Control , 2020, ICML.

[25]  Claude-Nicolas Fiechter,et al.  PAC adaptive control of linear systems , 1997, COLT '97.

[26]  Lei Guo,et al.  Adaptive continuous-time linear quadratic Gaussian control , 1999, IEEE Trans. Autom. Control..

[27]  Benjamin Van Roy,et al.  Model-based Reinforcement Learning and the Eluder Dimension , 2014, NIPS.

[28]  Petr Mandl,et al.  On least squares estimation in continuous time linear stochastic systems , 1992, Kybernetika.

[29]  Nikolai Matni,et al.  Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator , 2018, NeurIPS.