Distributed Dynamic Programming and an O.D.E. Framework of Distributed TD-Learning for Networked Multi-Agent Markov Decision Processes

— The primary objective of this paper is to investigate distributed dynamic programming (DP) and distributed temporal difference (TD) learning algorithms for networked multi-agent Markov decision problems (MAMDPs). In our study, we adopt a distributed multi-agent framework where individual agents have access only to their own rewards, lacking insights into the rewards of other agents. Additionally, each agent has the ability to share its parameters with neighboring agents through a communication network, represented by a graph. Our contributions can be summarized in two key points: 1) We introduce a novel distributed DP, inspired by the averaging consensus method in the continuous-time domain. The convergence of this DP is assessed through control theory perspectives. 2) Building upon the aforementioned DP, we devise a new distributed TD-learning algorithm and prove its convergence. A standout feature of our proposed distributed DP is its incorporation of two independent dynamic systems, each with a distinct role. This characteristic sets the stage for a novel distributed TD-learning strategy, the convergence of which can be directly established using the Borkar-Meyn theorem.

[1]  Shaoshuai Mou,et al.  Finite-Sample Analysis of Multi-Agent Policy Evaluation with Kernelized Gradient Temporal Difference , 2020, 2020 59th IEEE Conference on Decision and Control (CDC).

[2]  Tamer Basar,et al.  Asynchronous Policy Evaluation in Distributed Reinforcement Learning over Networks , 2020, Autom..

[3]  V. Cevher,et al.  Optimization for Reinforcement Learning: From a single agent to cooperative agents , 2019, IEEE Signal Processing Magazine.

[4]  T. Başar,et al.  Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms , 2019, Handbook of Reinforcement Learning and Control.

[5]  Mihailo R. Jovanovic,et al.  Fast Multi-Agent Temporal-Difference Learning via Homotopy Stochastic Primal-Dual Optimization , 2019, ArXiv.

[6]  Thinh T. Doan,et al.  Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation on Multi-Agent Reinforcement Learning , 2019, ICML.

[7]  Kun Yuan,et al.  Multiagent Fully Decentralized Value Function Learning With Linear Convergence Rates , 2018, IEEE Transactions on Automatic Control.

[8]  Zhuoran Yang,et al.  Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization , 2018, NeurIPS.

[9]  Milos S. Stankovic,et al.  Multi-agent temporal-difference learning with linear function approximation: Weak convergence under time-varying network topologies , 2016, 2016 American Control Conference (ACC).

[10]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[11]  Ali H. Sayed,et al.  Distributed Policy Evaluation Under Multiple Behavior Strategies , 2013, IEEE Transactions on Automatic Control.

[12]  L. A. Prashanth,et al.  Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods , 2012 .

[13]  Jing Wang,et al.  Control approach to distributed optimization , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[14]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[15]  Angelia Nedic,et al.  Subgradient Methods for Saddle-Point Problems , 2009, J. Optimization Theory and Applications.

[16]  Asuman E. Ozdaglar,et al.  Constrained Consensus and Optimization in Multi-Agent Networks , 2008, IEEE Transactions on Automatic Control.

[17]  L. Sucar,et al.  Markov Decision Processes , 2004, Encyclopedia of Machine Learning and Data Mining.

[18]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[19]  A. Jadbabaie,et al.  Coordination of groups of mobile autonomous agents using nearest neighbor rules , 2002, Proceedings of the 41st IEEE Conference on Decision and Control, 2002..

[20]  Jianghai Hu,et al.  Distributed Off-Policy Temporal Difference Learning Using Primal-Dual Method , 2022, IEEE Access.

[21]  Dimitri P. Bertsekas,et al.  Neuro-Dynamic Programming , 2009, Encyclopedia of Optimization.

[22]  Hassan K. Khalil,et al.  Nonlinear Systems Third Edition , 2008 .

[23]  Leonard M. Adleman,et al.  Proof of proposition 3 , 1992 .