Regret Bounds for the Adaptive Control of Linear Quadratic Systems

We study the average cost Linear Quadratic (LQ) control problem with unknown model parameters, also known as the adaptive control problem in the control community. We design an algorithm and prove that apart from logarithmic factors its regret up to time T is O( p T ). Unlike previous approaches that use a forced-exploration scheme, we construct a high-probability condence set around the model parameters and design an algorithm that plays optimistically with respect to this condence set. The construction of the condence set is based on the recent results from online least-squares estimation and leads to improved worst-case regret bound for the proposed algorithm. To the best of our knowledge this is the the rst time that a regret bound is derived for the LQ control problem.

[1]  H. Simon,et al.  Dynamic Programming Under Uncertainty with a Quadratic Criterion Function , 1956 .

[2]  R. Bellman Dynamic programming. , 1957, Science.

[3]  B. Anderson,et al.  Linear Optimal Control , 1971 .

[4]  G. Grisetti,et al.  Further Reading , 1984, IEEE Spectrum.

[5]  T. Lai,et al.  Least Squares Estimates in Stochastic Regression Models with Applications to Identification and Control of Dynamic Systems , 1982 .

[6]  P. Kumar,et al.  Adaptive control with the stochastic approximation algorithm: Geometry and convergence , 1985 .

[7]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[8]  Han-Fu Chen,et al.  Optimal adaptive control and consistent parameter estimates for ARMAX model with quadratic cost , 1986, 1986 25th IEEE Conference on Decision and Control.

[9]  T. Lai,et al.  Asymptotically efficient self-tuning regulators , 1987 .

[10]  Han-Fu Chen,et al.  Optimal adaptive control and consistent parameter estimates for ARMAX model withquadratic cost , 1987 .

[11]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[12]  Han-Fu Chen,et al.  Identification and adaptive control for systems with unknown orders, delay, and coefficients , 1990 .

[13]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[14]  Christopher G. Atkeson,et al.  Using Local Trajectory Optimizers to Speed Up Global Optimization in Dynamic Programming , 1993, NIPS.

[15]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[16]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[17]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[18]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[19]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[20]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[21]  Steven J. Bradtke,et al.  Incremental dynamic programming for on-line adaptive optimal control , 1995 .

[22]  Gavin Adrian Rummery Problem solving with reinforcement learning , 1995 .

[23]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[24]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[25]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[26]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[27]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[28]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[29]  Claude-Nicolas Fiechter,et al.  PAC adaptive control of linear systems , 1997, COLT '97.

[30]  Vijay Balasubramanian,et al.  Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions , 1996, Neural Computation.

[31]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[32]  P. Kumar,et al.  Adaptive Linear Quadratic Gaussian Control: The Cost-Biased Approach Revisited , 1998 .

[33]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[34]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[35]  J. W. Nieuwenhuis,et al.  Boekbespreking van D.P. Bertsekas (ed.), Dynamic programming and optimal control - volume 2 , 1999 .

[36]  Michael I. Jordan,et al.  On the Convergence of Temporal-Difference Learning with Linear Function Approximation , 2001 .

[37]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[38]  Christian Schindelhauer,et al.  Discrete Prediction Games with Arbitrary Feedback and Loss , 2001, COLT/EuroCOLT.

[39]  T. Lai,et al.  Self-Normalized Processes: Limit Theory and Statistical Applications , 2001 .

[40]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[41]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[42]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[43]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[44]  Jeff G. Schneider,et al.  Covariant policy search , 2003, IJCAI 2003.

[45]  A. Shapiro Monte Carlo Sampling Methods , 2003 .

[46]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[47]  John Langford,et al.  Exploration in Metric State Spaces , 2003, ICML.

[48]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[49]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[50]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[51]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[52]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[53]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[54]  Satinder Singh,et al.  An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[55]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[56]  Vladislav Tadic,et al.  On the Convergence of Temporal-Difference Learning with Linear Function Approximation , 2001, Machine Learning.

[57]  Robert D. Kleinberg Nearly Tight Bounds for the Continuum-Armed Bandit Problem , 2004, NIPS.

[58]  Richard S. Sutton,et al.  Temporal-Difference Networks , 2004, NIPS.

[59]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[60]  Michael L. Littman,et al.  A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[61]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[62]  Richard S. Sutton,et al.  Temporal Abstraction in Temporal-difference Networks , 2005, NIPS.

[63]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[64]  Zhiliang Ying,et al.  EFFICIENT RECURSIVE ESTIMATION AND ADAPTIVE CONTROL IN STOCHASTIC REGRESSION AND , 2006 .

[65]  Nicolò Cesa-Bianchi,et al.  Regret Minimization Under Partial Monitoring , 2006, 2006 IEEE Information Theory Workshop - ITW '06 Punta del Este.

[66]  S. Bittanti,et al.  ADAPTIVE CONTROL OF LINEAR TIME INVARIANT SYSTEMS: THE "BET ON THE BEST" PRINCIPLE ∗ , 2006 .

[67]  Thomas P. Hayes,et al.  Robbing the bandit: less regret in online geometric optimization against an adaptive adversary , 2006, SODA '06.

[68]  R. Sutton Gain Adaptation Beats Least Squares , 2006 .

[69]  Warren B. Powell,et al.  Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[70]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annu. Rev. Control..

[71]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[72]  Martin A. Riedmiller,et al.  Neural Reinforcement Learning Controllers for a Real Robot Application , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[73]  Michael L. Littman,et al.  Online Linear Regression and Its Application to Model-Based Reinforcement Learning , 2007, NIPS.

[74]  Peter Auer,et al.  Improved Rates for the Stochastic Continuum-Armed Bandit Problem , 2007, COLT.

[75]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[76]  András Lörincz,et al.  The many faces of optimism: a unifying approach , 2008, ICML '08.

[77]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[78]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[79]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[80]  Csaba Szepesvári,et al.  Empirical Bernstein stopping , 2008, ICML '08.

[81]  M. Kosorok Introduction to Empirical Processes and Semiparametric Inference , 2008 .

[82]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[83]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[84]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[85]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[86]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[87]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[88]  Makoto Hashizume,et al.  Development of a colon endoscope robot that adjusts its locomotion through the use of reinforcement learning , 2010, International Journal of Computer Assisted Radiology and Surgery.

[89]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[90]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[91]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[92]  R. Sutton,et al.  GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[93]  Csaba Szepesvári,et al.  Online Least Squares Estimation with Self-Normalized Processes: An Application to Bandit Problems , 2011, ArXiv.

[94]  Steven I. Marcus,et al.  Simulation-based Algorithms for Markov Decision Processes/ Hyeong Soo Chang ... [et al.] , 2013 .