论文信息 - Fast Multi-Agent Temporal-Difference Learning via Homotopy Stochastic Primal-Dual Optimization

Fast Multi-Agent Temporal-Difference Learning via Homotopy Stochastic Primal-Dual Optimization

We consider a distributed multi-agent policy evaluation problem in reinforcement learning. In our setup, a group of agents with jointly observed states and private local actions and rewards collaborates to learn the value function of a given policy. When the dimension of state-action space is large, the temporal-difference learning with linear function approximation is widely used. Under the assumption that the samples are i.i.d., the best-known convergence rate for multi-agent temporal-difference learning is $O(1/\sqrt{T})$ minimizing the mean square projected Bellman error. In this paper, we formulate the temporal-difference learning as a distributed stochastic saddle point problem, and propose a new homotopy primal-dual algorithm by adaptively restarting the gradient update from the average of previous iterations. We prove that our algorithm enjoys an $O(1/T)$ convergence rate up to logarithmic factors of $T$, thereby significantly improving the previously-known convergence results on multi-agent temporal-difference learning. Furthermore, since our result explicitly takes into account the Markovian nature of the sampling in policy evaluation, it addresses a broader class of problems than the commonly used i.i.d. sampling scenario. From a stochastic optimization perspective, to the best of our knowledge, the proposed homotopy primal-dual algorithm is the first to achieve $O(1/T)$ convergence rate for distributed stochastic saddle point problem.

[1] Thinh T. Doan,et al. Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation on Multi-Agent Reinforcement Learning , 2019, ICML.

[2] Ali H. Sayed,et al. Distributed Policy Evaluation Under Multiple Behavior Strategies , 2013, IEEE Transactions on Automatic Control.

[3] Shie Mannor,et al. Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning , 2017, COLT.

[4] Tie-Yan Liu,et al. Finite sample analysis of the GTD Policy Evaluation Algorithms in Markov Setting , 2017, NIPS.

[5] Michael G. Rabbat,et al. Distributed strongly convex optimization , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[6] Tamer Basar,et al. Asynchronous Policy Evaluation in Distributed Reinforcement Learning over Networks , 2020, Autom..

[7] Martin J. Wainwright,et al. Information-theoretic lower bounds on the oracle complexity of convex optimization , 2009, NIPS.

[8] Thinh T. Doan,et al. Finite-Time Performance of Distributed Temporal Difference Learning with Linear Function Approximation , 2019, SIAM J. Math. Data Sci..

[9] L. Györfi,et al. On the Averaged Stochastic Approximation for Linear Regression , 1996 .

[10] Mohammad S. Obaidat,et al. Residential Energy Management in Smart Grid: A Markov Decision Process-Based Approach , 2013, 2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing.

[11] Milos S. Stankovic,et al. Multi-agent temporal-difference learning with linear function approximation: Weak convergence under time-varying network topologies , 2016, 2016 American Control Conference (ACC).

[12] Tamer Basar,et al. Decentralized multi-agent reinforcement learning with networked agents: recent advances , 2019, Frontiers of Information Technology & Electronic Engineering.

[13] Qing Ling,et al. Solving Non-smooth Constrained Programs with Lower Complexity than \mathcal{O}(1/\varepsilon): A Primal-Dual Homotopy Smoothing Approach , 2018, NeurIPS.

[14] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[15] Sean P. Meyn,et al. The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[16] Bin Hu,et al. Characterizing the Exact Behaviors of Temporal Difference Learning Algorithms Using Markov Jump Linear System Theory , 2019, NeurIPS.

[17] Marek Petrik,et al. Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[18] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[19] Stephen P. Boyd,et al. Gossip algorithms: design, analysis and applications , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[20] Tianbao Yang,et al. Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/\epsilon) , 2016, NIPS.

[21] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[22] Pascal Vincent,et al. Convergent Tree-Backup and Retrace with Function Approximation , 2017, ICML.

[23] R. Srikant,et al. Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning , 2019, COLT.

[24] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[25] Vivek S. Borkar,et al. Distributed Reinforcement Learning via Gossip , 2013, IEEE Transactions on Automatic Control.

[26] Ioannis Ch. Paschalidis,et al. A Distributed Actor-Critic Algorithm and Applications to Mobile Sensor Network Coordination Problems , 2010, IEEE Transactions on Automatic Control.

[27] Zhuoran Yang,et al. Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization , 2018, NeurIPS.

[28] Jianghai Hu,et al. Primal-Dual Distributed Temporal Difference Learning. , 2018, 1805.07918.

[29] Asuman E. Ozdaglar,et al. Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[30] Naira Hovakimyan,et al. Primal-Dual Algorithm for Distributed Reinforcement Learning: Distributed GTD , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[31] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[32] Lin Xiao,et al. A Proximal-Gradient Homotopy Method for the Sparse Least-Squares Problem , 2012, SIAM J. Optim..

[33] Richard S. Sutton,et al. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[34] Martin J. Wainwright,et al. Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[35] Tamer Basar,et al. Finite-Sample Analyses for Fully Decentralized Multi-Agent Reinforcement Learning , 2018, ArXiv.

[36] Yingbin Liang,et al. Finite-Sample Analysis for SARSA and Q-Learning with Linear Function Approximation , 2019, ArXiv.

[37] Jan Peters,et al. Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[38] Shimon Whiteson,et al. Multiagent Reinforcement Learning for Urban Traffic Control Using Coordination Graphs , 2008, ECML/PKDD.

[39] Kun Yuan,et al. Multiagent Fully Decentralized Value Function Learning With Linear Convergence Rates , 2018, IEEE Transactions on Automatic Control.

[40] Jalaj Bhandari,et al. A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , 2018, COLT.

[41] Ali H. Sayed,et al. Multi-Agent Fully Decentralized Off-Policy Learning with Linear Convergence Rates , 2018, ArXiv.

[42] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[43] Andrew G. Barto,et al. Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[44] Georgios B. Giannakis,et al. Finite-Sample Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation , 2020, AISTATS.

[45] Martin Zinkevich,et al. Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[46] I. Pinelis. OPTIMUM BOUNDS FOR THE DISTRIBUTIONS OF MARTINGALES IN BANACH SPACES , 1994, 1208.2200.

[47] Tianbao Yang,et al. RSG: Beating Subgradient Method without Smoothness and Strong Convexity , 2015, J. Mach. Learn. Res..

[48] Elizabeth L. Wilmer,et al. Markov Chains and Mixing Times , 2008 .

[49] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[50] H. Vincent Poor,et al. QD-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus + Innovations , 2012, IEEE Trans. Signal Process..

[51] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[52] Thinh T. Doan,et al. Finite-Time Performance of Distributed Two-Time-Scale Stochastic Approximation , 2019, L4DC.

[53] Volkan Cevher,et al. Optimization for Reinforcement Learning: From a single agent to cooperative agents , 2020, IEEE Signal Processing Magazine.

[54] Kaiqing Zhang,et al. Finite-Sample Analysis For Decentralized Batch Multi-Agent Reinforcement Learning With Networked Agents. , 2018 .

[55] Shie Mannor,et al. Finite Sample Analyses for TD(0) With Function Approximation , 2017, AAAI.

[56] Tamer Basar,et al. Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms , 2019, Handbook of Reinforcement Learning and Control.

[57] Csaba Szepesvári,et al. Linear Stochastic Approximation: How Far Does Constant Step-Size and Iterate Averaging Go? , 2018, AISTATS.

[58] Michael I. Jordan,et al. Ergodic mirror descent , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[59] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[60] Angelia Nedic,et al. Subgradient Methods for Saddle-Point Problems , 2009, J. Optimization Theory and Applications.

[61] Tamer Basar,et al. Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents , 2018, ICML.

[62] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[63] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[64] Qing Ling,et al. Solving Non-smooth Constrained Programs with Lower Complexity than 𝒪(1/ε): A Primal-Dual Homotopy Smoothing Approach , 2018, NeurIPS.