A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning

We study a new two-time-scale stochastic gradient method for solving optimization problems, where the gradients are computed with the aid of an auxiliary variable under samples generated by time-varying Markov random processes parameterized by the underlying optimization variable. These time-varying samples make gradient directions in our update biased and dependent, which can potentially lead to the divergence of the iterates. In our two-time-scale approach, one scale is to estimate the true gradient from these samples, which is then used to update the estimate of the optimal solution. While these two iterates are implemented simultaneously, the former is updated “faster” (using bigger step sizes) than the latter (using smaller step sizes). Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale stochastic gradient method. In particular, we provide explicit formulas for the convergence rates of this method under different structural assumptions, namely, strong convexity, convexity, the Polyak-Łojasiewicz (PŁ) condition, and general non-convexity. We apply our framework to two problems in control and reinforcement learning. First, we look at the standard online actor-critic algorithm over finite state and action spaces and derive a convergence rate of O p k ´ 2 { 5 q , which recovers the best known rate derived specifically for this problem. Second, we study an online actor-critic algorithm for the linear-quadratic regulator and show that a convergence rate of O p k ´ 2 { 3 q is achieved. This is the first time such a result is known in the literature. Finally, we support our theoretical analysis with numerical simulations where the convergence rates are visualized. of assumed structural properties of the function being optimized. Our abstraction unifies the analysis of actor-critic methods in RL, and we show how our main results reproduce the best-known convergence rates for the general policy optimization problem and how they can be used to derive a state-of-the-art rate for the online linear-quadratic regulator (LQR) controllers.

[1]  Thinh T. Doan,et al.  Finite-Sample Analysis of Two-Time-Scale Natural Actor–Critic Algorithm , 2021, IEEE Transactions on Automatic Control.

[2]  Thinh T. Doan,et al.  Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning , 2019, Autom..

[3]  Tyler H. Summers,et al.  Learning Optimal Controllers for Linear Systems With Multiplicative Noise via Policy Gradient , 2021, IEEE Transactions on Automatic Control.

[4]  Jieping Ye,et al.  On Finite-Time Convergence of Actor-Critic Algorithm , 2021, IEEE Journal on Selected Areas in Information Theory.

[5]  Thinh T. Doan,et al.  Finite-Time Analysis of Decentralized Stochastic Approximation with Applications in Multi-Agent and Multi-Task Learning , 2020, 2021 60th IEEE Conference on Decision and Control (CDC).

[6]  Thinh T. Doan,et al.  Finite-Time Analysis and Restarting Scheme for Linear Two-Time-Scale Stochastic Approximation , 2019, SIAM J. Control. Optim..

[7]  Siva Theja Maguluri,et al.  Finite Sample Analysis of Average-Reward TD Learning and $Q$-Learning , 2021, NeurIPS.

[8]  A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic , 2020, ArXiv.

[9]  Zhe Wang,et al.  Non-asymptotic Convergence Analysis of Two Time-scale (Natural) Actor-Critic Algorithms , 2020, ArXiv.

[10]  Quanquan Gu,et al.  A Finite Time Analysis of Two Time-Scale Actor Critic Methods , 2020, NeurIPS.

[11]  Mikhail Belkin,et al.  Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning , 2020, ArXiv.

[12]  Hoi-To Wai,et al.  Finite Time Analysis of Linear Two-timescale Stochastic Approximation with Markovian Noise , 2020, COLT.

[13]  Thinh T. Doan,et al.  Finite-Time Performance of Distributed Two-Time-Scale Stochastic Approximation , 2019, L4DC.

[14]  Balázs Szörényi,et al.  A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound , 2019, AAAI.

[15]  On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation , 2019, ArXiv.

[16]  Thinh T. Doan,et al.  Linear Two-Time-Scale Stochastic Approximation A Finite-Time Analysis , 2019, 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[17]  R. Srikant,et al.  Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning , 2019, NeurIPS.

[18]  Yongxin Chen,et al.  On the Global Convergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost , 2019, ArXiv.

[19]  Shaofeng Zou,et al.  Finite-Sample Analysis for SARSA with Linear Function Approximation , 2019, NeurIPS.

[20]  Tamer Basar,et al.  A Finite Sample Analysis of the Actor-Critic Algorithm , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[21]  Sham M. Kakade,et al.  Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.

[22]  Shie Mannor,et al.  Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning , 2017, COLT.

[23]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[24]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[25]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[26]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[27]  A. Mokkadem,et al.  Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms , 2006, math/0610329.

[28]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[29]  Vijay R. Konda,et al.  Convergence rate of linear two-time-scale stochastic approximation , 2004, math/0405287.

[30]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[31]  Vijay R. Konda,et al.  Actor-Critic Algorithms , 1999, NIPS.