Posterior Sampling-Based Reinforcement Learning for Control of Unknown Linear Systems

We propose a posterior sampling-based learning algorithm for the linear quadratic (LQ) control problem with unknown system parameters. The algorithm is called posterior sampling-based reinforcement learning for LQ regulator (PSRL-LQ) where two stopping criteria determine the lengths of the dynamic episodes in posterior sampling. The first stopping criterion controls the growth rate of episode length. The second stopping criterion is triggered when the determinant of the sample covariance matrix is less than half of the previous value. We show under some conditions on the prior distribution that the expected (Bayesian) regret of PSRL-LQ accumulated up to time <inline-formula><tex-math notation="LaTeX">$T$</tex-math></inline-formula> is bounded by <inline-formula><tex-math notation="LaTeX">$\tilde{O}(\sqrt{T})$</tex-math></inline-formula>. Here, <inline-formula><tex-math notation="LaTeX">$\tilde{O}(\cdot)$</tex-math></inline-formula> hides constants and logarithmic factors. Numerical simulations are provided to illustrate the performance of PSRL-LQ.

[1]  P. Kumar,et al.  Adaptive control with the stochastic approximation algorithm: Geometry and convergence , 1985 .

[2]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[3]  Adel Javanmard,et al.  Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems , 2012, NIPS.

[4]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[5]  Csaba Szepesvári,et al.  Bayesian Optimal Control of Smoothly Parameterized Systems , 2015, UAI.

[6]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[7]  Nevena Lazic,et al.  Regret Bounds for Model-Free Linear Quadratic Control , 2018, ArXiv.

[8]  Michael Jong Kim,et al.  Thompson Sampling for Stochastic Control: The Finite Parameter Case , 2017, IEEE Transactions on Automatic Control.

[9]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[10]  Maria Adler,et al.  Stable Adaptive Systems , 2016 .

[11]  Yi Ouyang,et al.  Learning-based Control of Unknown Linear Systems with Thompson Sampling , 2017, ArXiv.

[12]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[13]  Ambuj Tewari,et al.  Finite Time Identification in Unstable Linear Systems , 2017, Autom..

[14]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[15]  Han-Fu Chen,et al.  Convergence rate of least-squares identification and adaptive control for stochastic systems† , 1986 .

[16]  B. Pasik-Duncan,et al.  Adaptive Control , 1996, IEEE Control Systems.

[17]  Frank L. Lewis,et al.  Reinforcement Learning and Approximate Dynamic Programming for Feedback Control , 2012 .

[18]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[19]  Jan Sternby,et al.  On consistency for the method of least squares using martingale theory , 1977 .

[20]  P. Kumar,et al.  Adaptive Linear Quadratic Gaussian Control: The Cost-Biased Approach Revisited , 1998 .

[21]  Benjamin Van Roy,et al.  Posterior Sampling for Reinforcement Learning Without Episodes , 2016, ArXiv.

[22]  Yi Ouyang,et al.  Learning Unknown Markov Decision Processes: A Thompson Sampling Approach , 2017, NIPS.

[23]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[24]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[25]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[26]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[27]  Graham C. Goodwin,et al.  Adaptive filtering prediction and control , 1984 .

[28]  Benjamin Recht,et al.  Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator , 2017, ICML.

[29]  Ambuj Tewari,et al.  On Optimality of Adaptive Linear-Quadratic Regulators , 2018, ArXiv.

[30]  Alessandro Lazaric,et al.  Improved Regret Bounds for Thompson Sampling in Linear Quadratic Control Problems , 2018, ICML.

[31]  Xi-Ren Cao,et al.  Event-Based Optimization of Markov Systems , 2008, IEEE Transactions on Automatic Control.

[32]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[33]  S. Sastry,et al.  Adaptive Control: Stability, Convergence and Robustness , 1989 .

[34]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[35]  Shie Mannor,et al.  Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[36]  Alessandro Lazaric,et al.  Thompson Sampling for Linear-Quadratic Control Problems , 2017, AISTATS.