Learning Expensive Coordination: An Event-Based Deep RL Approach

Existing works in deep Multi-Agent Reinforcement Learning (MARL) mainly focus on coordinating cooperative agents to complete certain tasks jointly. However, in many cases of the real world, agents are self-interested such as employees in a company and clubs in a league. Therefore, the leader, i.e., the manager of the company or the league, needs to provide bonuses to followers for efficient coordination, which we call expensive coordination. The main difficulties of expensive coordination are that i) the leader has to consider the long-term effect and predict the followers' behaviors when assigning bonuses and ii) the complex interactions between followers make the training process hard to converge, especially when the leader's policy changes with time. In this work, we address this problem through an event-based deep RL approach. Our main contributions are threefold. (1) We model the leader's decision-making process as a semi-Markov Decision Process and propose a novel multi-agent event-based policy gradient to learn the leader's long-term policy. (2) We exploit the leader-follower consistency scheme to design a follower-aware module and a follower-specific attention module to predict the followers' behaviors and make accurate response to their behaviors. (3) We propose an action abstraction-based policy gradient algorithm to reduce the followers' decision space and thus accelerate the training process of followers. Experiments in resource collections, navigation, and the predator-prey game reveal that our approach outperforms the state-of-the-art methods dramatically.

[1]  V. Borkar Stochastic approximation with two time scales , 1997 .

[2]  Régis Sabbadin,et al.  A Tractable Leader-Follower MDP Model for Animal Disease Management , 2013, AAAI.

[3]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[4]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[5]  Fei Sha,et al.  Actor-Attention-Critic for Multi-Agent Reinforcement Learning , 2019, ICML.

[6]  Yan Hong,et al.  Reinforcement Mechanism Design, with Applications to Dynamic Pricing in Sponsored Search Auctions , 2020, AAAI.

[7]  Jordan L. Boyd-Graber,et al.  Opponent Modeling in Deep Reinforcement Learning , 2016, ICML.

[8]  Pingzhong Tang Reinforcement mechanism design , 2017, IJCAI.

[9]  George J. Pappas,et al.  Taxi Dispatch With Real-Time Sensing Data in Metropolitan Areas: A Receding Horizon Control Approach , 2016, IEEE Transactions on Automation Science and Engineering.

[10]  Sergio Valcarcel Macua,et al.  Coordinating the Crowd: Inducing Desirable Equilibria in Non-Cooperative Systems , 2019, AAMAS.

[11]  Alex Graves,et al.  Strategic Attentive Writer for Learning Macro-Actions , 2016, NIPS.

[12]  B. Chaib-draa,et al.  Multiagent Q-Learning : Preliminary Study on Dominance between the Nash and Stackelberg Equilibriums , 2005 .

[13]  Ron Lavi,et al.  Algorithmic Mechanism Design , 2008, Encyclopedia of Algorithms.

[14]  Joel Z. Leibo,et al.  A Generalised Method for Empirical Game Theoretic Analysis , 2018, AAMAS.

[15]  Utkarsh Upadhyay,et al.  Deep Reinforcement Learning of Marked Temporal Point Processes , 2018, NeurIPS.

[16]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[17]  Yan Zheng,et al.  A Deep Bayesian Policy Reuse Approach Against Non-Stationary Agents , 2018, NeurIPS.

[18]  Doina Precup,et al.  Intra-Option Learning about Temporally Abstract Actions , 1998, ICML.

[19]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2015, ICLR.

[21]  Alexandre Alahi,et al.  Crowd-Robot Interaction: Crowd-Aware Robot Navigation With Attention-Based Deep Reinforcement Learning , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[22]  Claudia V. Goldman,et al.  Solving Transition Independent Decentralized Markov Decision Processes , 2004, J. Artif. Intell. Res..

[23]  Noam Brown,et al.  Superhuman AI for multiplayer poker , 2019, Science.

[24]  Chi Cheng,et al.  A multi-agent reinforcement learning algorithm based on Stackelberg game , 2017, 2017 6th Data Driven Control and Learning Systems (DDCLS).

[25]  H. Francis Song,et al.  Machine Theory of Mind , 2018, ICML.

[26]  Philip S. Thomas,et al.  Learning Action Representations for Reinforcement Learning , 2019, ICML.

[27]  S. Bhattacharyya,et al.  Leader-Follower semi-Markov Decision Problems: Theoretical Framework and Approximate Solution , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[28]  Shimon Whiteson,et al.  DAC: The Double Actor-Critic Architecture for Learning Options , 2019, NeurIPS.

[29]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[30]  Joelle Pineau,et al.  An Inference-Based Policy Gradient Method for Learning Options , 2018, ICML.

[31]  Luciano Messori The Theory of Incentives I: The Principal-Agent Model , 2013 .

[32]  Régis Sabbadin,et al.  Leader-Follower MDP Models with Factored State Space and Many Followers - Followers Abstraction, Structured Dynamics and State Aggregation , 2016, ECAI.

[33]  Lillian J. Ratliff,et al.  Convergence of Learning Dynamics in Stackelberg Games , 2019, ArXiv.

[34]  Shimon Whiteson,et al.  Learning with Opponent-Learning Awareness , 2017, AAMAS.

[35]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[36]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[37]  Akshat Kumar,et al.  Planning and Learning for Decentralized MDPs With Event Driven Rewards , 2018, AAAI.

[38]  Nicolas Le Roux,et al.  The Value Function Polytope in Reinforcement Learning , 2019, ICML.

[39]  Jan Peters,et al.  Probabilistic inference for determining options in reinforcement learning , 2016, Machine Learning.

[40]  Alan Fern,et al.  Learning and Transferring Roles in Multi-Agent Reinforcement , 2008 .