M^3RL: Mind-aware Multi-agent Management Reinforcement Learning

Most of the prior work on multi-agent reinforcement learning (MARL) achieves optimal collaboration by directly controlling the agents to maximize a common reward. In this paper, we aim to address this from a different angle. In particular, we consider scenarios where there are self-interested agents (i.e., worker agents) which have their own minds (preferences, intentions, skills, etc.) and can not be dictated to perform tasks they do not wish to do. For achieving optimal coordination among these agents, we train a super agent (i.e., the manager) to manage them by first inferring their minds based on both current and past observations and then initiating contracts to assign suitable tasks to workers and promise to reward them with corresponding bonuses so that they will agree to work together. The objective of the manager is maximizing the overall productivity as well as minimizing payments made to the workers for ad-hoc worker teaming. To train the manager, we propose Mind-aware Multi-agent Management Reinforcement Learning (M^3RL), which consists of agent modeling and policy learning. We have evaluated our approach in two environments, Resource Collection and Crafting, to simulate multi-agent management problems with various task settings and multiple designs for the worker agents. The experimental results have validated the effectiveness of our approach in modeling worker agents' minds online, and in achieving optimal ad-hoc teaming with good generalization and fast adaptation.

[1]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[2]  Sergey Levine,et al.  One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning , 2018, Robotics: Science and Systems.

[3]  Ramesh Raskar,et al.  Designing Neural Network Architectures using Reinforcement Learning , 2017, ICLR.

[4]  Yuliy Sannikov A Continuous-Time Version of the Principal-Agent , 2005 .

[5]  Samuel Gershman,et al.  Deep Successor Reinforcement Learning , 2016, ArXiv.

[6]  Jonathan P. How,et al.  Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability , 2017, ICML.

[7]  Dan Klein,et al.  Modular Multitask Reinforcement Learning with Policy Sketches , 2017, ICML.

[8]  Peng Peng,et al.  Multiagent Bidirectionally-Coordinated Nets: Emergence of Human-level Coordination in Learning to Play StarCraft Combat Games , 2017, 1703.10069.

[9]  Luciano Messori The Theory of Incentives I: The Principal-Agent Model , 2013 .

[10]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[11]  Sarit Kraus,et al.  Ad Hoc Autonomous Agent Teams: Collaboration without Pre-Coordination , 2010, AAAI.

[12]  Chris L. Baker,et al.  Action understanding as inverse planning , 2009, Cognition.

[13]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[14]  Pieter Abbeel,et al.  Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments , 2018, ICLR.

[15]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[16]  Misha Denil,et al.  Learned Optimizers that Scale and Generalize , 2017, ICML.

[17]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[18]  Ali Farhadi,et al.  Visual Semantic Planning Using Deep Successor Representations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Yoshua Bengio,et al.  Universal Successor Representations for Transfer Reinforcement Learning , 2018, ICLR.

[20]  Bharath Hariharan,et al.  Low-Shot Visual Recognition by Shrinking and Hallucinating Features , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  David C. Parkes,et al.  Policy teaching through reward function learning , 2009, EC '09.

[22]  R. Myerson Optimal coordination mechanisms in generalized principal–agent problems , 1982 .

[23]  P. Milgrom,et al.  Multitask Principal–Agent Analyses: Incentive Contracts, Asset Ownership, and Job Design , 1991 .

[24]  Nikos A. Vlassis,et al.  Optimal and Approximate Q-value Functions for Decentralized POMDPs , 2008, J. Artif. Intell. Res..

[25]  Michael H. Bowling,et al.  Coordination and Adaptation in Impromptu Teams , 2005, AAAI.

[26]  Roger B. Myerson,et al.  Optimal Auction Design , 1981, Math. Oper. Res..

[27]  Anca D. Dragan,et al.  Simplifying Reward Design through Divide-and-Conquer , 2018, Robotics: Science and Systems.

[28]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[29]  Carlos Guestrin,et al.  Multiagent Planning with Factored MDPs , 2001, NIPS.

[30]  Daphne Koller,et al.  Computing Factored Value Functions for Policies in Structured MDPs , 1999, IJCAI.

[31]  Vincent Conitzer,et al.  Complexity of Mechanism Design , 2002, UAI.

[32]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[33]  Richard L. Lewis,et al.  Reward Design via Online Gradient Ascent , 2010, NIPS.

[34]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2004, Machine Learning.

[35]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[36]  David C. Parkes,et al.  Value-Based Policy Teaching with Active Indirect Elicitation , 2008, AAAI.

[37]  Bengt Holmstrom Moral Hazard and Observability , 1979 .

[38]  H. Francis Song,et al.  Machine Theory of Mind , 2018, ICML.

[39]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[40]  Marcin Andrychowicz,et al.  One-Shot Imitation Learning , 2017, NIPS.

[41]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[42]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning , 2018, AAMAS.