QD-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus + Innovations

The paper develops <formula formulatype="inline"> <tex Notation="TeX">${{\cal Q} {\cal D}}$</tex></formula>-learning, a distributed version of reinforcement <formula formulatype="inline"><tex Notation="TeX">$Q$</tex> </formula>-learning, for multi-agent Markov decision processes (MDPs); the agents have no prior information on the global state transition and on the local agent cost statistics. The network agents minimize a network-averaged infinite horizon discounted cost, by local processing and by collaborating through mutual information exchange over a sparse (possibly stochastic) communication network. The agents respond differently (depending on their instantaneous one-stage random costs) to a global controlled state and the control actions of a remote controller. When each agent is aware only of its local online cost data and the inter-agent communication network is weakly connected, we prove that <formula formulatype="inline"> <tex Notation="TeX">${{\cal Q} {\cal D}}$</tex></formula>-learning, a <formula formulatype="inline"> <tex Notation="TeX">$\rm consensus + innovations$</tex></formula> algorithm with mixed time-scale stochastic dynamics, converges asymptotically almost surely to the desired value function and to the optimal stationary control policy at each network agent.

[1]  Milind Tambe,et al.  The Communicative Multiagent Team Decision Problem: Analyzing Teamwork Theories and Models , 2011, J. Artif. Intell. Res..

[2]  M.G. Rabbat,et al.  Generalized consensus computation in networked systems with erasure links , 2005, IEEE 6th Workshop on Signal Processing Advances in Wireless Communications, 2005..

[3]  A. Shiryaev,et al.  Limit Theorems for Stochastic Processes , 1987 .

[4]  Richard S. Sutton,et al.  Reinforcement Learning is Direct Adaptive Optimal Control , 1992, 1991 American Control Conference.

[5]  Ali H. Sayed,et al.  Diffusion Least-Mean Squares Over Adaptive Networks: Formulation and Performance Analysis , 2008, IEEE Transactions on Signal Processing.

[6]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[7]  Reza Olfati-Saber,et al.  Consensus and Cooperation in Networked Multi-Agent Systems , 2007, Proceedings of the IEEE.

[8]  Shin'ichi Yuta,et al.  Coordinating Autonomous And Centralized Decision Making To Achieve Cooperative Behaviors Between Multiple Mobile Robots , 1992, Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems.

[9]  Yoav Shoham,et al.  Multi-Agent Reinforcement Learning:a critical survey , 2003 .

[10]  Martin Lauer,et al.  An Algorithm for Distributed Reinforcement Learning in Cooperative Multi-Agent Systems , 2000, ICML.

[11]  José M. F. Moura,et al.  Cooperative Convex Optimization in Networked Systems: Augmented Lagrangian Algorithms With Directed Gossip Communication , 2010, IEEE Transactions on Signal Processing.

[12]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[13]  Manuela M. Veloso,et al.  Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[14]  John Baillieul,et al.  Robust and efficient quantization and coding for control of multidimensional linear systems under data rate constraints , 2006, CDC 2006.

[15]  Michael L. Littman,et al.  Value-function reinforcement learning in Markov games , 2001, Cognitive Systems Research.

[16]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1993, Proceedings of 32nd IEEE Conference on Decision and Control.

[17]  Jie Lin,et al.  Coordination of groups of mobile autonomous agents using nearest neighbor rules , 2003, IEEE Trans. Autom. Control..

[18]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[19]  Stephen P. Boyd,et al.  Randomized gossip algorithms , 2006, IEEE Transactions on Information Theory.

[20]  Nikos A. Vlassis,et al.  Non-communicative multi-robot coordination in dynamic environments , 2005, Robotics Auton. Syst..

[21]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[22]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[23]  Soummya Kar,et al.  Distributed Consensus Algorithms in Sensor Networks: Quantized Data and Random Link Failures , 2007, IEEE Transactions on Signal Processing.

[24]  Peter Stone,et al.  CMUnited: a team of robotics soccer agents collaborating in an adversarial environment , 1998, CROS.

[25]  John N. Tsitsiklis,et al.  On distributed averaging algorithms and quantization effects , 2007, 2008 47th IEEE Conference on Decision and Control.

[26]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[27]  Richard M. Murray,et al.  Consensus problems in networks of agents with switching topology and time-delays , 2004, IEEE Transactions on Automatic Control.

[28]  José M. F. Moura,et al.  Large deviations analysis of consensus+innovations detection in random networks , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[29]  Peter Secretan Learning , 1965, Mental Health.

[30]  Hiroaki Kitano,et al.  RoboCup-97: The First Robot World Cup Soccer Games and Conferences , 1998, AI Mag..

[31]  Angelia Nedic,et al.  Incremental Stochastic Subgradient Algorithms for Convex Optimization , 2008, SIAM J. Optim..

[32]  Soummya Kar,et al.  Distributed Consensus Algorithms in Sensor Networks With Imperfect Communication: Link Failures and Channel Noise , 2007, IEEE Transactions on Signal Processing.

[33]  Michail G. Lagoudakis,et al.  Coordinated Reinforcement Learning , 2002, ICML.

[34]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[35]  H. Vincent Poor,et al.  Distributed Linear Parameter Estimation: Asymptotically Efficient Adaptive Strategies , 2011, SIAM J. Control. Optim..

[36]  John N. Tsitsiklis,et al.  Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[37]  Sekhar Tatikonda,et al.  Control under communication constraints , 2004, IEEE Transactions on Automatic Control.

[38]  Soummya Kar,et al.  Gossip Algorithms for Distributed Signal Processing , 2010, Proceedings of the IEEE.

[39]  Ian A. Hiskens,et al.  Achieving Controllability of Electric Loads , 2011, Proceedings of the IEEE.

[40]  Michael William Newman,et al.  The Laplacian spectrum of graphs , 2001 .

[41]  Andrey V. Savkin,et al.  The problem of state estimation via asynchronous communication channels with irregular transmission times , 2001, Proceedings of the 40th IEEE Conference on Decision and Control (Cat. No.01CH37228).

[42]  Soummya Kar,et al.  Convergence Rate Analysis of Distributed Gossip (Linear Parameter) Estimation: Fundamental Limits and Tradeoffs , 2010, IEEE Journal of Selected Topics in Signal Processing.

[43]  B. Mohar THE LAPLACIAN SPECTRUM OF GRAPHS y , 1991 .

[44]  C.C. White,et al.  Dynamic programming and stochastic control , 1978, Proceedings of the IEEE.

[45]  Nikos A. Vlassis,et al.  Using the Max-Plus Algorithm for Multiagent Decision Making in Coordination Graphs , 2005, BNAIC.

[46]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[47]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[48]  Manuela M. Veloso,et al.  Decentralized MDPs with sparse interactions , 2011, Artif. Intell..

[49]  Gonzalo Mateos,et al.  Distributed Sparse Linear Regression , 2010, IEEE Transactions on Signal Processing.

[50]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[51]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[52]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[53]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[54]  Soummya Kar,et al.  Distributed Parameter Estimation in Sensor Networks: Nonlinear Observation Models and Imperfect Communication , 2008, IEEE Transactions on Information Theory.

[55]  Daniel Kudenko,et al.  Reinforcement learning of coordination in cooperative multi-agent systems , 2002, AAAI/IAAI.