Learning to Incentivize Other Learning Agents

The challenge of developing powerful and general Reinforcement Learning (RL) agents has received increasing attention in recent years. Much of this effort has focused on the single-agent setting, in which an agent maximizes a predefined extrinsic reward function. However, a long-term question inevitably arises: how will such independent agents cooperate when they are continually learning and acting in a shared multi-agent environment? Observing that humans often provide incentives to influence others' behavior, we propose to equip each RL agent in a multi-agent environment with the ability to give rewards directly to other agents, using a learned incentive function. Each agent learns its own incentive function by explicitly accounting for its impact on the learning of recipients and, through them, the impact on its own extrinsic objective. We demonstrate in experiments that such agents significantly outperform standard RL and opponent-shaping agents in challenging general-sum Markov games, often by finding a near-optimal division of labor. Our work points toward more opportunities and challenges along the path to ensure the common good in a multi-agent future.

[1]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[2]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[3]  Marcus Hutter,et al.  Reward tampering problems and solutions in reinforcement learning: a causal influence diagram perspective , 2019, Synthese.

[4]  Tom Eccles,et al.  Learning Reciprocity in Complex Sequential Social Dilemmas , 2019, ArXiv.

[5]  Shimon Whiteson,et al.  Stable Opponent Shaping in Differentiable Games , 2018, ICLR.

[6]  Joel Z. Leibo,et al.  Evolving intrinsic motivations for altruistic behavior , 2018, AAMAS.

[7]  Nando de Freitas,et al.  Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning , 2018, ICML.

[8]  Guy Lever,et al.  Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[9]  John Shawe-Taylor,et al.  Adaptive Mechanism Design: Learning to Promote Cooperation , 2018, 2020 International Joint Conference on Neural Networks (IJCNN).

[10]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[11]  Satinder Singh,et al.  On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[12]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[13]  Joel Z. Leibo,et al.  Inequity aversion improves cooperation in intertemporal social dilemmas , 2018, NeurIPS.

[14]  Thore Graepel,et al.  The Mechanics of n-Player Differentiable Games , 2018, ICML.

[15]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[16]  Peter Stone,et al.  Autonomous agents modelling other agents: A comprehensive survey and open problems , 2017, Artif. Intell..

[17]  Shimon Whiteson,et al.  Learning with Opponent-Learning Awareness , 2017, AAMAS.

[18]  Alexander Peysakhovich,et al.  Maintaining cooperation in complex social dilemmas using deep reinforcement learning , 2017, ArXiv.

[19]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[20]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[21]  Joel Z. Leibo,et al.  Multi-agent Reinforcement Learning in Sequential Social Dilemmas , 2017, AAMAS.

[22]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[23]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Michael L. Littman,et al.  Coco-Q: Learning in Stochastic Games with Side Payments , 2013, ICML.

[26]  Erol Akçay,et al.  The evolution of payoff matrices: providing incentives to cooperate , 2011, Proceedings of the Royal Society B: Biological Sciences.

[27]  Yuk-fai Fong,et al.  The optimal degree of cooperation in the repeated Prisoners' Dilemma with side payments , 2009, Games Econ. Behav..

[28]  Bård Harstad,et al.  Do Side Payments Help? Collective Decisions and Strategic Delegation , 2008 .

[29]  Patrice Marcotte,et al.  An overview of bilevel optimization , 2007, Ann. Oper. Res..

[30]  Matthew O. Jackson,et al.  Endogenous games and mechanisms: Side payments among players , 2005 .

[31]  Robert Dur,et al.  Incentives and Workers Motivation in the Public Sector , 2004, SSRN Electronic Journal.

[32]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[33]  Werner Güth,et al.  An evolutionary approach to explaining cooperative behavior by reciprocal incentives , 1995 .

[34]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[35]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[36]  Doina Precup,et al.  Gifting in Multi-Agent Reinforcement Learning (Student Abstract) , 2020, AAAI.

[37]  Sangwoo Moon,et al.  Inducing Cooperation through Reward Reshaping based on Peer Evaluations in Deep Multi-Agent Reinforcement Learning , 2020, AAMAS.

[38]  Richard L. Lewis,et al.  Where Do Rewards Come From , 2009 .

[39]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[40]  T. L. Schwartz The Logic of Collective Action , 1986 .

[41]  J. Veroff,et al.  Social Incentives: A Life-Span Developmental Approach , 1980 .

[42]  P. A. Reynolds Economic Sanctions and International Enforcement , 1980 .

[43]  A. Rapoport Prisoner’s Dilemma — Recollections and Observations , 1974 .

[44]  M. Olson,et al.  The Logic of Collective Action , 1965 .