论文信息 - Characterizing the Exact Behaviors of Temporal Difference Learning Algorithms Using Markov Jump Linear System Theory

Characterizing the Exact Behaviors of Temporal Difference Learning Algorithms Using Markov Jump Linear System Theory

In this paper, we provide a unified analysis of temporal difference learning algorithms with linear function approximators by exploiting their connections to Markov jump linear systems (MJLS). We tailor the MJLS theory developed in the control community to characterize the exact behaviors of the first and second order moments of a large family of temporal difference learning algorithms. For both the IID and Markov noise cases, we show that the evolution of some augmented versions of the mean and covariance matrix of the TD estimation error exactly follows the trajectory of a deterministic linear time-invariant (LTI) dynamical system. Applying the well-known LTI system theory, we obtain closed-form expressions for the mean and covariance matrix of the TD estimation error at any time step. We provide a tight matrix spectral radius condition to guarantee the convergence of the covariance matrix of the TD estimation error, and perform a perturbation analysis to characterize the dependence of the TD behaviors on learning rate. For the IID case, we provide an exact formula characterizing how the mean and covariance matrix of the TD estimation error converge to the steady state values. For the Markov case, we use our formulas to explain how the behaviors of TD learning algorithms are affected by learning rate and the underlying Markov chain. For both cases, upper and lower bounds for the mean square TD error are provided. The mean square TD error is shown to converge linearly to an exact limit.

Bin Hu | Usman Ahmed Syed | B. Hu | U. Syed

[1] João Pedro Hespanha,et al. Linear Systems Theory , 2009 .

[2] H. Kushner,et al. Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[3] Thinh T. Doan,et al. Performance of Q-learning with Linear Function Approximation: Stability and Finite-Time Analysis , 2019 .

[4] R. Srikant,et al. Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning , 2019, COLT.

[5] Mihailo R. Jovanovic,et al. The Proximal Augmented Lagrangian Method for Nonsmooth Composite Optimization , 2016, IEEE Transactions on Automatic Control.

[6] Bin Hu,et al. Dissipativity Theory for Nesterov's Accelerated Method , 2017, ICML.

[7] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[8] M. Overton,et al. On the Lidskii-Vishik-Lyusternik Perturbation Theory for Eigenvalues of Matrices with Arbitrary Jordan Structure , 1997, SIAM J. Matrix Anal. Appl..

[9] Chi-Tsong Chen,et al. Linear System Theory and Design , 1995 .

[10] Raja Sengupta,et al. A bounded real lemma for jump systems , 2003, IEEE Trans. Autom. Control..

[11] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[12] Asuman E. Ozdaglar,et al. A Universally Optimal Multistage Accelerated Stochastic Gradient Method , 2019, NeurIPS.

[13] Peter Seiler,et al. Direct Synthesis of Iterative Algorithms With Bounds on Achievable Worst-Case Convergence Rate , 2019, 2020 American Control Conference (ACC).

[14] Tosio Kato. Perturbation theory for linear operators , 1966 .

[15] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16] Bin Hu,et al. Control interpretations for first-order optimization methods , 2017, 2017 American Control Conference (ACC).

[17] Bin Hu,et al. Dissipativity Theory for Accelerating Stochastic Variance Reduction: A Unified Analysis of SVRG and Katyusha Using Semidefinite Programs , 2018, ICML.

[18] L. Ghaoui,et al. Robust state-feedback stabilization of jump linear systems , 1996 .

[19] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[20] R. Sutton,et al. A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[21] Benjamin Recht,et al. Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints , 2014, SIAM J. Optim..

[22] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[23] Necdet Serhat Aybat,et al. Robust Accelerated Gradient Method , 2018 .

[24] Peter Seiler,et al. A Unified Analysis of Stochastic Optimization Methods Using Jump System Theory and Quadratic Constraints , 2017, COLT.

[25] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[26] R. P. Marques,et al. Discrete-Time Markov Jump Linear Systems , 2004, IEEE Transactions on Automatic Control.

[27] Jan Peters,et al. Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[28] Bin Hu,et al. Robust convergence analysis of distributed optimization algorithms , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[29] Bin Hu,et al. A Robust Accelerated Optimization Algorithm for Strongly Convex Functions , 2017, 2018 Annual American Control Conference (ACC).

[30] Hisham Abou-Kandil,et al. On the solution of discrete-time Markovian jump linear quadratic control problems , 1995, Autom..

[31] Yuguang Fang,et al. Stochastic stability of jump linear systems , 2002, IEEE Trans. Autom. Control..

[32] Marek Petrik,et al. Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[33] Enrique Mallada,et al. An integral quadratic constraint framework for real-time steady-state optimization of linear time-invariant systems , 2018, 2018 Annual American Control Conference (ACC).

[34] A. Willsky,et al. Discrete-time Markovian-jump linear quadratic optimal control , 1986 .

[35] K. Loparo,et al. Stochastic stability properties of jump linear systems , 1992 .

[36] Shuo Han,et al. Systematic Design of Decentralized Algorithms for Consensus Optimization , 2019, IEEE Control Systems Letters.

[37] L. A. Prashanth,et al. Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods , 2012 .

[38] H. Chizeck,et al. Controllability, observability and discrete-time markovian jump linear quadratic control , 1988 .

[39] Manuel Kindelan,et al. Laurent expansion of the inverse of perturbed, singular matrices , 2015, J. Comput. Phys..

[40] R. Srikant,et al. Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning , 2019, NeurIPS.