Multi-Agent Reinforcement Learning in Time-varying Networked Systems

We study multi-agent reinforcement learning (MARL) in a time-varying network of agents. The objective is to find localized policies that maximize the (discounted) global reward. In general, scalability is a challenge in this setting because the size of the global state/action space can be exponential in the number of agents. Scalable algorithms are only known in cases where dependencies are static, fixed and local, e.g., between neighbors in a fixed, time-invariant underlying graph. In this work, we propose a Scalable Actor Critic framework that applies in settings where the dependencies can be non-local and time-varying, and provide a finite-time error bound that shows how the convergence rate depends on the speed of information spread in the network. Additionally, as a byproduct of our analysis, we obtain novel finite-time convergence results for a general stochastic approximation scheme and for temporal difference learning with state aggregation, which apply beyond the setting of RL in networked systems.

[1]  Peter Stone,et al.  State Abstraction Discovery from Irrelevant State Variables , 2005, IJCAI.

[2]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[3]  Marco Pavone,et al.  Control of robotic mobility-on-demand systems: A queueing-theoretical perspective , 2014, Int. J. Robotics Res..

[4]  Nan Jiang,et al.  On Oracle-Efficient PAC RL with Rich Observations , 2018, NeurIPS.

[5]  Guillaume J. Laurent,et al.  Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[6]  Adam Wierman,et al.  Scalable Multi-Agent Reinforcement Learning for Networked Systems with Average Reward , 2020, NeurIPS.

[7]  Michael J. Neely Optimal Backpressure Routing for Wireless Networks with Multi-Receiver Diversity , 2006 .

[8]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[9]  Nan Jiang,et al.  Abstraction Selection in Model-based Reinforcement Learning , 2015, ICML.

[10]  Robbert van Renesse,et al.  The power of epidemics: robust communication for large-scale distributed systems , 2003, CCRV.

[11]  Tamer Basar,et al.  Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents , 2018, ICML.

[12]  Shu Wang,et al.  Fundamental Analysis of Full-Duplex Gains in Wireless Networks , 2017, IEEE/ACM Transactions on Networking.

[13]  Adam Wierman,et al.  Scalable Reinforcement Learning of Localized Policies for Multi-Agent Networked Systems , 2019, L4DC.

[14]  Adam Wierman,et al.  Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning , 2020, COLT.

[15]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[16]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[17]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[18]  Renyuan Xu,et al.  Q-Learning for Mean-Field Controls , 2020, ArXiv.

[19]  Francesco Bullo,et al.  On the dynamics of deterministic epidemic propagation over networks , 2017, Annu. Rev. Control..

[20]  Christos Faloutsos,et al.  Epidemic thresholds in real networks , 2008, TSEC.

[21]  Fernando Paganini,et al.  Distributed control of spatially invariant systems , 2002, IEEE Trans. Autom. Control..

[22]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[23]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[24]  Nader Motee,et al.  Optimal Control of Spatially Distributed Systems , 2008, 2007 American Control Conference.

[25]  Juan M López,et al.  Nonequilibrium phase transition in a model for the propagation of innovations among economic agents. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[26]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[27]  Yingbin Liang,et al.  Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples , 2019, NeurIPS.

[28]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[29]  Thinh T. Doan,et al.  Finite-Time Analysis and Restarting Scheme for Linear Two-Time-Scale Stochastic Approximation , 2019, SIAM J. Control. Optim..

[30]  David Gamarnik,et al.  Correlation Decay in Random Decision Networks , 2009, Math. Oper. Res..

[31]  E. David,et al.  Networks, Crowds, and Markets: Reasoning about a Highly Connected World , 2010 .

[32]  Qichao Zhang,et al.  Reinforcement Learning and Deep Learning Based Lateral Control for Autonomous Driving [Application Notes] , 2019, IEEE Comput. Intell. Mag..

[33]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[34]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[35]  Quanquan Gu,et al.  A Finite Time Analysis of Two Time-Scale Actor Critic Methods , 2020, NeurIPS.

[36]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[37]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[38]  R. Srikant,et al.  Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning , 2019, COLT.

[39]  John N. Tsitsiklis,et al.  The Complexity of Optimal Queuing Network Control , 1999, Math. Oper. Res..

[40]  Devavrat Shah,et al.  Q-learning with Nearest Neighbors , 2018, NeurIPS.

[41]  Thinh T. Doan,et al.  Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation on Multi-Agent Reinforcement Learning , 2019, ICML.