Fast Multi-Agent Temporal-Difference Learning via Homotopy Stochastic Primal-Dual Optimization

We consider a distributed multi-agent policy evaluation problem in reinforcement learning. In our setup, a group of agents with jointly observed states and private local actions and rewards collaborates to learn the value function of a given policy. When the dimension of state-action space is large, the temporal-difference learning with linear function approximation is widely used. Under the assumption that the samples are i.i.d., the best-known convergence rate for multi-agent temporal-difference learning is $O(1/\sqrt{T})$ minimizing the mean square projected Bellman error. In this paper, we formulate the temporal-difference learning as a distributed stochastic saddle point problem, and propose a new homotopy primal-dual algorithm by adaptively restarting the gradient update from the average of previous iterations. We prove that our algorithm enjoys an $O(1/T)$ convergence rate up to logarithmic factors of $T$, thereby significantly improving the previously-known convergence results on multi-agent temporal-difference learning. Furthermore, since our result explicitly takes into account the Markovian nature of the sampling in policy evaluation, it addresses a broader class of problems than the commonly used i.i.d. sampling scenario. From a stochastic optimization perspective, to the best of our knowledge, the proposed homotopy primal-dual algorithm is the first to achieve $O(1/T)$ convergence rate for distributed stochastic saddle point problem.

[1]  Thinh T. Doan,et al.  Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation on Multi-Agent Reinforcement Learning , 2019, ICML.

[2]  Ali H. Sayed,et al.  Distributed Policy Evaluation Under Multiple Behavior Strategies , 2013, IEEE Transactions on Automatic Control.

[3]  Shie Mannor,et al.  Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning , 2017, COLT.

[4]  Tie-Yan Liu,et al.  Finite sample analysis of the GTD Policy Evaluation Algorithms in Markov Setting , 2017, NIPS.

[5]  Michael G. Rabbat,et al.  Distributed strongly convex optimization , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[6]  Tamer Basar,et al.  Asynchronous Policy Evaluation in Distributed Reinforcement Learning over Networks , 2020, Autom..

[7]  Martin J. Wainwright,et al.  Information-theoretic lower bounds on the oracle complexity of convex optimization , 2009, NIPS.

[8]  Thinh T. Doan,et al.  Finite-Time Performance of Distributed Temporal Difference Learning with Linear Function Approximation , 2019, SIAM J. Math. Data Sci..

[9]  L. Györfi,et al.  On the Averaged Stochastic Approximation for Linear Regression , 1996 .

[10]  Mohammad S. Obaidat,et al.  Residential Energy Management in Smart Grid: A Markov Decision Process-Based Approach , 2013, 2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing.

[11]  Milos S. Stankovic,et al.  Multi-agent temporal-difference learning with linear function approximation: Weak convergence under time-varying network topologies , 2016, 2016 American Control Conference (ACC).

[12]  Tamer Basar,et al.  Decentralized multi-agent reinforcement learning with networked agents: recent advances , 2019, Frontiers of Information Technology & Electronic Engineering.

[13]  Qing Ling,et al.  Solving Non-smooth Constrained Programs with Lower Complexity than \mathcal{O}(1/\varepsilon): A Primal-Dual Homotopy Smoothing Approach , 2018, NeurIPS.

[14]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[15]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[16]  Bin Hu,et al.  Characterizing the Exact Behaviors of Temporal Difference Learning Algorithms Using Markov Jump Linear System Theory , 2019, NeurIPS.

[17]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[18]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[19]  Stephen P. Boyd,et al.  Gossip algorithms: design, analysis and applications , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[20]  Tianbao Yang,et al.  Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/\epsilon) , 2016, NIPS.

[21]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[22]  Pascal Vincent,et al.  Convergent Tree-Backup and Retrace with Function Approximation , 2017, ICML.

[23]  R. Srikant,et al.  Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning , 2019, COLT.

[24]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[25]  Vivek S. Borkar,et al.  Distributed Reinforcement Learning via Gossip , 2013, IEEE Transactions on Automatic Control.

[26]  Ioannis Ch. Paschalidis,et al.  A Distributed Actor-Critic Algorithm and Applications to Mobile Sensor Network Coordination Problems , 2010, IEEE Transactions on Automatic Control.

[27]  Zhuoran Yang,et al.  Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization , 2018, NeurIPS.

[28]  Jianghai Hu,et al.  Primal-Dual Distributed Temporal Difference Learning. , 2018, 1805.07918.

[29]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[30]  Naira Hovakimyan,et al.  Primal-Dual Algorithm for Distributed Reinforcement Learning: Distributed GTD , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[31]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[32]  Lin Xiao,et al.  A Proximal-Gradient Homotopy Method for the Sparse Least-Squares Problem , 2012, SIAM J. Optim..

[33]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[34]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[35]  Tamer Basar,et al.  Finite-Sample Analyses for Fully Decentralized Multi-Agent Reinforcement Learning , 2018, ArXiv.

[36]  Yingbin Liang,et al.  Finite-Sample Analysis for SARSA and Q-Learning with Linear Function Approximation , 2019, ArXiv.

[37]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[38]  Shimon Whiteson,et al.  Multiagent Reinforcement Learning for Urban Traffic Control Using Coordination Graphs , 2008, ECML/PKDD.

[39]  Kun Yuan,et al.  Multiagent Fully Decentralized Value Function Learning With Linear Convergence Rates , 2018, IEEE Transactions on Automatic Control.

[40]  Jalaj Bhandari,et al.  A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , 2018, COLT.

[41]  Ali H. Sayed,et al.  Multi-Agent Fully Decentralized Off-Policy Learning with Linear Convergence Rates , 2018, ArXiv.

[42]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[43]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[44]  Georgios B. Giannakis,et al.  Finite-Sample Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation , 2020, AISTATS.

[45]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[46]  I. Pinelis OPTIMUM BOUNDS FOR THE DISTRIBUTIONS OF MARTINGALES IN BANACH SPACES , 1994, 1208.2200.

[47]  Tianbao Yang,et al.  RSG: Beating Subgradient Method without Smoothness and Strong Convexity , 2015, J. Mach. Learn. Res..

[48]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[49]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[50]  H. Vincent Poor,et al.  QD-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus + Innovations , 2012, IEEE Trans. Signal Process..

[51]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[52]  Thinh T. Doan,et al.  Finite-Time Performance of Distributed Two-Time-Scale Stochastic Approximation , 2019, L4DC.

[53]  Volkan Cevher,et al.  Optimization for Reinforcement Learning: From a single agent to cooperative agents , 2020, IEEE Signal Processing Magazine.

[54]  Kaiqing Zhang,et al.  Finite-Sample Analysis For Decentralized Batch Multi-Agent Reinforcement Learning With Networked Agents. , 2018 .

[55]  Shie Mannor,et al.  Finite Sample Analyses for TD(0) With Function Approximation , 2017, AAAI.

[56]  Tamer Basar,et al.  Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms , 2019, Handbook of Reinforcement Learning and Control.

[57]  Csaba Szepesvári,et al.  Linear Stochastic Approximation: How Far Does Constant Step-Size and Iterate Averaging Go? , 2018, AISTATS.

[58]  Michael I. Jordan,et al.  Ergodic mirror descent , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[59]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[60]  Angelia Nedic,et al.  Subgradient Methods for Saddle-Point Problems , 2009, J. Optimization Theory and Applications.

[61]  Tamer Basar,et al.  Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents , 2018, ICML.

[62]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[63]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[64]  Qing Ling,et al.  Solving Non-smooth Constrained Programs with Lower Complexity than 𝒪(1/ε): A Primal-Dual Homotopy Smoothing Approach , 2018, NeurIPS.