论文信息 - Multi-Agent Reinforcement Learning in Time-varying Networked Systems - 字舞流文

Multi-Agent Reinforcement Learning in Time-varying Networked Systems

We study multi-agent reinforcement learning (MARL) in a time-varying network of agents. The objective is to find localized policies that maximize the (discounted) global reward. In general, scalability is a challenge in this setting because the size of the global state/action space can be exponential in the number of agents. Scalable algorithms are only known in cases where dependencies are static, fixed and local, e.g., between neighbors in a fixed, time-invariant underlying graph. In this work, we propose a Scalable Actor Critic framework that applies in settings where the dependencies can be non-local and time-varying, and provide a finite-time error bound that shows how the convergence rate depends on the speed of information spread in the network. Additionally, as a byproduct of our analysis, we obtain novel finite-time convergence results for a general stochastic approximation scheme and for temporal difference learning with state aggregation, which apply beyond the setting of RL in networked systems.

Adam Wierman | Guannan Qu | Longbo Huang | Yiheng Lin

[1] Peter Stone,et al. State Abstraction Discovery from Irrelevant State Variables , 2005, IJCAI.

[2] Michael I. Jordan,et al. Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[3] Marco Pavone,et al. Control of robotic mobility-on-demand systems: A queueing-theoretical perspective , 2014, Int. J. Robotics Res..

[4] Nan Jiang,et al. On Oracle-Efficient PAC RL with Rich Observations , 2018, NeurIPS.

[5] Guillaume J. Laurent,et al. Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[6] Adam Wierman,et al. Scalable Multi-Agent Reinforcement Learning for Networked Systems with Average Reward , 2020, NeurIPS.

[7] Michael J. Neely. Optimal Backpressure Routing for Wireless Networks with Multi-Receiver Diversity , 2006 .

[8] Thomas J. Walsh,et al. Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[9] Nan Jiang,et al. Abstraction Selection in Model-based Reinforcement Learning , 2015, ICML.

[10] Robbert van Renesse,et al. The power of epidemics: robust communication for large-scale distributed systems , 2003, CCRV.

[11] Tamer Basar,et al. Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents , 2018, ICML.

[12] Shu Wang,et al. Fundamental Analysis of Full-Duplex Gains in Wireless Networks , 2017, IEEE/ACM Transactions on Networking.

[13] Adam Wierman,et al. Scalable Reinforcement Learning of Localized Policies for Multi-Agent Networked Systems , 2019, L4DC.

[14] Adam Wierman,et al. Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning , 2020, COLT.

[15] John N. Tsitsiklis,et al. Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[16] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[17] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[18] Renyuan Xu,et al. Q-Learning for Mean-Field Controls , 2020, ArXiv.

[19] Francesco Bullo,et al. On the dynamics of deterministic epidemic propagation over networks , 2017, Annu. Rev. Control..

[20] Christos Faloutsos,et al. Epidemic thresholds in real networks , 2008, TSEC.

[21] Fernando Paganini,et al. Distributed control of spatially invariant systems , 2002, IEEE Trans. Autom. Control..

[22] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[23] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[24] Nader Motee,et al. Optimal Control of Spatially Distributed Systems , 2008, 2007 American Control Conference.

[25] Juan M López,et al. Nonequilibrium phase transition in a model for the propagation of innovations among economic agents. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[26] Yi Wu,et al. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[27] Yingbin Liang,et al. Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples , 2019, NeurIPS.

[28] Bart De Schutter,et al. A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[29] Thinh T. Doan,et al. Finite-Time Analysis and Restarting Scheme for Linear Two-Time-Scale Stochastic Approximation , 2019, SIAM J. Control. Optim..

[30] David Gamarnik,et al. Correlation Decay in Random Decision Networks , 2009, Math. Oper. Res..

[31] E. David,et al. Networks, Crowds, and Markets: Reasoning about a Highly Connected World , 2010 .

[32] Qichao Zhang,et al. Reinforcement Learning and Deep Learning Based Lateral Control for Autonomous Driving [Application Notes] , 2019, IEEE Comput. Intell. Mag..

[33] Tor Lattimore,et al. PAC Bounds for Discounted MDPs , 2012, ALT.

[34] Craig Boutilier,et al. The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[35] Quanquan Gu,et al. A Finite Time Analysis of Two Time-Scale Actor Critic Methods , 2020, NeurIPS.

[36] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[37] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[38] R. Srikant,et al. Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning , 2019, COLT.

[39] John N. Tsitsiklis,et al. The Complexity of Optimal Queuing Network Control , 1999, Math. Oper. Res..

[40] Devavrat Shah,et al. Q-learning with Nearest Neighbors , 2018, NeurIPS.

[41] Thinh T. Doan,et al. Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation on Multi-Agent Reinforcement Learning , 2019, ICML.