Exploration-exploitation trade-off for continuous-time episodic reinforcement learning with linear-convex models

Abstract. We develop a probabilistic framework for analysing model-based reinforcement learning in the episodic setting. We then apply it to study finite-time horizon stochastic control problems with linear dynamics but unknown coefficients and convex, but possibly irregular, objective function. Using probabilistic representations, we study regularity of the associated cost functions and establish precise estimates for the performance gap between applying optimal feedback control derived from estimated and true model parameters. We identify conditions under which this performance gap is quadratic, improving the linear performance gap in recent work [X. Guo, A. Hu, and Y. Zhang, arXiv preprint, arXiv:2104.09311, (2021)], which matches the results obtained for stochastic linear-quadratic problems. Next, we propose a phase-based learning algorithm for which we show how to optimise exploration-exploitation trade-off and achieve sublinear regrets in high probability and expectation. When assumptions needed for the quadratic performance gap hold, the algorithm achieves an order O( √ N lnN) high probability regret, in the general case, and an order O((lnN)2) expected regret, in self-exploration case, over N episodes, matching the best possible results from the literature. The analysis requires novel concentration inequalities for correlated continuous-time observations, which we derive.

[1]  Benjamin Recht,et al.  Certainty Equivalent Control of LQR is Efficient , 2019, ArXiv.

[2]  Samuel N. Cohen,et al.  Parameter Uncertainty in the Kalman-Bucy Filter , 2017, SIAM J. Control. Optim..

[3]  A. Shiryayev,et al.  Statistics of Random Processes I: General Theory , 1984 .

[4]  S. Peng,et al.  Backward Stochastic Differential Equations in Finance , 1997 .

[5]  Rémi Munos,et al.  Reinforcement Learning for Continuous Stochastic Control Problems , 1997, NIPS.

[6]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[7]  Rémi Munos,et al.  Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[8]  Xin Guo,et al.  Logarithmic Regret for Episodic Continuous-Time Linear-Quadratic Reinforcement Learning Over a Finite-Time Horizon , 2020, J. Mach. Learn. Res..

[9]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[10]  Lei Guo,et al.  Adaptive continuous-time linear quadratic Gaussian control , 1999, IEEE Trans. Autom. Control..

[11]  Samuel N. Cohen,et al.  Pathwise stochastic control with applications to robust filtering , 2019, The Annals of Applied Probability.

[12]  Assaf J. Zeevi,et al.  On Incomplete Learning and Certainty-Equivalence Control , 2017, Oper. Res..

[13]  N. Krylov,et al.  Introduction to the Theory of Random Processes , 2002 .

[14]  Benjamin Naumann,et al.  Classical Descriptive Set Theory , 2016 .

[15]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[16]  L. Szpruch,et al.  Gradient Flows for Regularized Stochastic Control Problems , 2020, ArXiv.

[17]  S. Peng,et al.  Backward stochastic differential equations and quasilinear parabolic partial differential equations , 1992 .

[18]  Xun Yu Zhou,et al.  Reinforcement Learning in Continuous Time and Space: A Stochastic Control Approach , 2020, J. Mach. Learn. Res..

[19]  D. Crisan,et al.  Fundamentals of Stochastic Filtering , 2008 .

[20]  Benjamin Van Roy,et al.  Model-based Reinforcement Learning and the Eluder Dimension , 2014, NIPS.

[21]  Rémi Munos,et al.  A Study of Reinforcement Learning in the Continuous Case by the Means of Viscosity Solutions , 2000, Machine Learning.

[22]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[23]  Quanquan Gu,et al.  Logarithmic Regret for Reinforcement Learning with Linear Function Approximation , 2020, ICML.

[24]  A. Bensoussan Stochastic Control of Partially Observable Systems , 1992 .

[25]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[26]  Christoph Reisinger,et al.  Regularity and stability of feedback relaxed controls , 2021, SIAM Journal on Control and Optimization.

[27]  Max Simchowitz,et al.  Naive Exploration is Optimal for Online LQR , 2020, ICML.

[28]  D. Sworder Stochastic calculus and applications , 1984, IEEE Transactions on Automatic Control.

[29]  Nikolai Matni,et al.  On the Sample Complexity of the Linear Quadratic Regulator , 2017, Foundations of Computational Mathematics.

[30]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[31]  Arnaud Lionnet,et al.  Time discretization of FBSDE with polynomial growth drivers and reaction-diffusion PDEs , 2013, 1309.2865.

[32]  Alessandro Lazaric,et al.  Improved Regret Bounds for Thompson Sampling in Linear Quadratic Control Problems , 2018, ICML.

[33]  G. A. Young,et al.  High‐dimensional Statistics: A Non‐asymptotic Viewpoint, Martin J.Wainwright, Cambridge University Press, 2019, xvii 552 pages, £57.99, hardback ISBN: 978‐1‐1084‐9802‐9 , 2020, International Statistical Review.

[34]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[35]  Xin Guo,et al.  Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls , 2021, ArXiv.

[36]  Xin Guo,et al.  Entropy Regularization for Mean Field Games with Learning , 2020, Math. Oper. Res..

[37]  A. Guillin,et al.  Transportation cost-information inequalities and applications to random dynamical systems and diffusions , 2004, math/0410172.

[38]  Superlinear Drivers,et al.  7 – Backward Stochastic Differential Equations , 2011 .

[39]  Samuel N. Cohen,et al.  Asymptotic Randomised Control with applications to bandits. , 2020, 2010.07252.

[40]  W. Fleming,et al.  Controlled Markov processes and viscosity solutions , 1992 .

[41]  V. Borkar Controlled diffusion processes , 2005, math/0511077.

[42]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[43]  X. Zhou,et al.  Stochastic Controls: Hamiltonian Systems and HJB Equations , 1999 .

[44]  W. T. Tucker Linear Estimation and Stochastic Control , 1984 .