Decentralized Graph-Based Multi-Agent Reinforcement Learning Using Reward Machines

In multi-agent reinforcement learning (MARL), it is challenging for a collection of agents to learn complex temporally extended tasks. The difficulties lie in computational complexity and how to learn the high-level ideas behind reward functions. We study the graph-based Markov Decision Process (MDP) where the dynamics of neighboring agents are coupled. We use a reward machine (RM) to encode each agent's task and expose reward function internal structures. RM has the capacity to describe high-level knowledge and encode non-Markovian reward functions. We propose a decentralized learning algorithm to tackle computational complexity, called decentralized graph-based reinforcement learning using reward machines (DGRM), that equips each agent with a localized policy, allowing agents to make decisions independently, based on the information available to the agents. DGRM uses the actor-critic structure, and we introduce the tabular Q-function for discrete state problems. We show that the dependency of Q-function on other agents decreases exponentially as the distance between them increases. Furthermore, the complexity of DGRM is related to the local information size of the largest $κ$-hop neighborhood, and DGRM can find an $O(ρ

[1]  Carlos Guestrin,et al.  Multiagent Planning with Factored MDPs , 2001, NIPS.

[2]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[3]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[4]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2019, Autonomous Agents and Multi-Agent Systems.

[5]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[6]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[7]  Ufuk Topcu,et al.  Distributed Policy Synthesis of Multiagent Systems With Graph Temporal Logic Specifications , 2020, IEEE Transactions on Control of Network Systems.

[8]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[9]  Krysia Broda,et al.  Induction of Subgoal Automata for Reinforcement Learning , 2019, AAAI.

[10]  Tom Melham,et al.  DeepSynth: Automata Synthesis for Automatic Task Segmentation in Deep Reinforcement Learning , 2021, AAAI.

[11]  Shobha Venkataraman,et al.  Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[12]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[13]  Shiqi Zhang,et al.  Learning Quadruped Locomotion Policies with Reward Machines , 2021, ArXiv.

[14]  Tamer Basar,et al.  Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents , 2018, ICML.

[15]  Sheila A. McIlraith,et al.  Learning Reward Machines for Partially Observable Reinforcement Learning , 2019, NeurIPS.

[16]  Sheila A. McIlraith,et al.  Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning , 2020, J. Artif. Intell. Res..

[17]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[18]  Lei Ying,et al.  3M-RL: Multi-Resolution, Multi-Agent, Mean-Field Reinforcement Learning for Autonomous UAV Routing , 2021, IEEE Transactions on Intelligent Transportation Systems.

[19]  Ufuk Topcu,et al.  Joint Inference of Reward Machines and Policies for Reinforcement Learning , 2020, ICAPS.

[20]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[21]  Bernard Silver,et al.  A Framework for Multi-Paradigmatic Learning , 1990, ML.

[22]  Ufuk Topcu,et al.  Reward Machines for Cooperative Multi-Agent Reinforcement Learning , 2021, AAMAS.

[23]  Krysia Broda,et al.  Induction and Exploitation of Subgoal Automata for Reinforcement Learning , 2021, J. Artif. Intell. Res..

[24]  John N. Tsitsiklis,et al.  A survey of computational complexity results in systems and control , 2000, Autom..

[25]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[26]  Yongming Liu,et al.  UAS Conflict Resolution Integrating a Risk-Based Operational Safety Bound as Airspace Reservation with Reinforcement Learning , 2020 .

[27]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[28]  Régis Sabbadin,et al.  Approximate Linear-Programming Algorithms for Graph-Based Markov Decision Processes , 2006, ECAI.

[29]  Sheila A. McIlraith,et al.  Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning , 2018, ICML.

[30]  A. Heald,et al.  “Stay at Home, Protect the National Health Service, Save Lives”: A cost benefit analysis of the lockdown in the United Kingdom , 2020, International journal of clinical practice.

[31]  Michael L. Littman,et al.  Value-function reinforcement learning in Markov games , 2001, Cognitive Systems Research.

[32]  Qiang Liu,et al.  Variational Planning for Graph-based MDPs , 2013, NIPS.

[33]  Adam Wierman,et al.  Scalable Reinforcement Learning of Localized Policies for Multi-Agent Networked Systems , 2019, L4DC.

[34]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[35]  Gavin Rens,et al.  Learning Non-Markovian Reward Models in MDPs , 2020, ArXiv.

[36]  M. Bernardo,et al.  Intermittent yet coordinated regional strategies can alleviate the COVID-19 epidemic: a network model of the Italian case , 2020, 2005.07594.

[37]  Shen Li,et al.  Planning With Uncertain Specifications (PUnS) , 2019, IEEE Robotics and Automation Letters.