Learning to Incentivize Other Learning Agents

The challenge of developing powerful and general Reinforcement Learning (RL) agents has received increasing attention in recent years. Much of this effort has focused on the single-agent setting, in which an agent maximizes a predefined extrinsic reward function. However, a long-term question inevitably arises: how will such independent agents cooperate when they are continually learning and acting in a shared multi-agent environment? Observing that humans often provide incentives to influence others' behavior, we propose to equip each RL agent in a multi-agent environment with the ability to give rewards directly to other agents, using a learned incentive function. Each agent learns its own incentive function by explicitly accounting for its impact on the learning of recipients and, through them, the impact on its own extrinsic objective. We demonstrate in experiments that such agents significantly outperform standard RL and opponent-shaping agents in challenging general-sum Markov games, often by finding a near-optimal division of labor. Our work points toward more opportunities and challenges along the path to ensure the common good in a multi-agent future.

[1]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[2]  Bård Harstad,et al.  Do Side Payments Help? Collective Decisions and Strategic Delegation , 2008 .

[3]  Werner Güth,et al.  An evolutionary approach to explaining cooperative behavior by reciprocal incentives , 1995 .

[4]  Peter Stone,et al.  Autonomous agents modelling other agents: A comprehensive survey and open problems , 2017, Artif. Intell..

[5]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[6]  Richard L. Lewis,et al.  Where Do Rewards Come From , 2009 .

[7]  M. Olson,et al.  The Logic of Collective Action , 1965 .

[8]  Erol Akçay,et al.  The evolution of payoff matrices: providing incentives to cooperate , 2011, Proceedings of the Royal Society B: Biological Sciences.

[9]  Robert Dur,et al.  Incentives and Workers Motivation in the Public Sector , 2004, SSRN Electronic Journal.

[10]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[11]  Shimon Whiteson,et al.  Learning with Opponent-Learning Awareness , 2017, AAMAS.

[12]  Nando de Freitas,et al.  Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning , 2018, ICML.

[13]  Tom Eccles,et al.  Learning Reciprocity in Complex Sequential Social Dilemmas , 2019, ArXiv.

[14]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[15]  Joel Z. Leibo,et al.  Multi-agent Reinforcement Learning in Sequential Social Dilemmas , 2017, AAMAS.

[16]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  Yuk-fai Fong,et al.  The optimal degree of cooperation in the repeated Prisoners' Dilemma with side payments , 2009, Games Econ. Behav..

[19]  Sangwoo Moon,et al.  Inducing Cooperation through Reward Reshaping based on Peer Evaluations in Deep Multi-Agent Reinforcement Learning , 2020, AAMAS.

[20]  J. Veroff,et al.  Social Incentives: A Life-Span Developmental Approach , 1980 .

[21]  Satinder Singh,et al.  On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[22]  Thore Graepel,et al.  The Mechanics of n-Player Differentiable Games , 2018, ICML.

[23]  A. Rapoport Prisoner’s Dilemma — Recollections and Observations , 1974 .

[24]  Michael L. Littman,et al.  Coco-Q: Learning in Stochastic Games with Side Payments , 2013, ICML.

[25]  Matthew O. Jackson,et al.  Endogenous games and mechanisms: Side payments among players , 2005 .

[26]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[27]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[28]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[29]  Guy Lever,et al.  Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[30]  Shimon Whiteson,et al.  Stable Opponent Shaping in Differentiable Games , 2018, ICLR.

[31]  Marcus Hutter,et al.  Reward tampering problems and solutions in reinforcement learning: a causal influence diagram perspective , 2019, Synthese.

[32]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[33]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[34]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[35]  P. A. Reynolds Economic Sanctions and International Enforcement , 1980 .

[36]  Joel Z. Leibo,et al.  Inequity aversion improves cooperation in intertemporal social dilemmas , 2018, NeurIPS.

[37]  T. L. Schwartz The Logic of Collective Action , 1986 .

[38]  Patrice Marcotte,et al.  An overview of bilevel optimization , 2007, Ann. Oper. Res..

[39]  Joel Z. Leibo,et al.  Evolving intrinsic motivations for altruistic behavior , 2018, AAMAS.

[40]  John Shawe-Taylor,et al.  Adaptive Mechanism Design: Learning to Promote Cooperation , 2018, 2020 International Joint Conference on Neural Networks (IJCNN).

[41]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[42]  Doina Precup,et al.  Gifting in Multi-Agent Reinforcement Learning (Student Abstract) , 2020, AAAI.

[43]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[44]  Alexander Peysakhovich,et al.  Maintaining cooperation in complex social dilemmas using deep reinforcement learning , 2017, ArXiv.