Learning with Opponent-Learning Awareness

Multi-agent settings are quickly gathering importance in machine learning. This includes a plethora of recent work on deep multi-agent reinforcement learning, but also can be extended to hierarchical RL, generative adversarial networks and decentralised optimisation. In all these settings the presence of multiple learning agents renders the training problem non-stationary and often leads to unstable training or undesired final results. We present Learning with Opponent-Learning Awareness (LOLA), a method in which each agent shapes the anticipated learning of the other agents in the environment. The LOLA learning rule includes a term that accounts for the impact of one agent's policy on the anticipated parameter update of the other agents. Results show that the encounter of two LOLA agents leads to the emergence of tit-for-tat and therefore cooperation in the iterated prisoners' dilemma, while independent learning does not. In this domain, LOLA also receives higher payouts compared to a naive learner, and is robust against exploitation by higher order gradient-based methods. Applied to repeated matching pennies, LOLA agents converge to the Nash equilibrium. In a round robin tournament we show that LOLA agents successfully shape the learning of a range of multi-agent learning algorithms from literature, resulting in the highest average returns on the IPD. We also show that the LOLA update rule can be efficiently calculated using an extension of the policy gradient estimator, making the method suitable for model-free RL. The method thus scales to large parameter and input spaces and nonlinear function approximators. We apply LOLA to a grid world task with an embedded social dilemma using recurrent policies and opponent modelling. By explicitly considering the learning of the other agent, LOLA agents learn to cooperate out of self-interest. The code is at this http URL.

[1]  S. Vajda,et al.  GAMES AND DECISIONS; INTRODUCTION AND CRITICAL SURVEY. , 1958 .

[2]  King Lee,et al.  The Application of Decision Theory and Dynamic Programming to Adaptive Control Systems , 1967 .

[3]  Roger B. Myerson,et al.  Game theory - Analysis of Conflict , 1991 .

[4]  R. Gibbons Game theory for applied economists , 1992 .

[5]  Robert H. Crites,et al.  Multiagent reinforcement learning in the Iterated Prisoner's Dilemma. , 1996, Bio Systems.

[6]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[7]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[8]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[9]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[10]  Ronen I. Brafman,et al.  Efficient learning equilibrium , 2004, Artificial Intelligence.

[11]  Bernard Manderick,et al.  Extended Replicator Dynamics as a Key to Reinforcement Learning in Multi-agent Systems , 2003, ECML.

[12]  William T. B. Uther,et al.  Adversarial Reinforcement Learning , 2003 .

[13]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[14]  Michael L. Littman,et al.  Cyclic Equilibria in Markov Games , 2005, NIPS.

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  Vincent Conitzer,et al.  AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[17]  Sandip Sen,et al.  Reaching pareto-optimality in prisoner’s dilemma using conditional joint action learning , 2007, Autonomous Agents and Multi-Agent Systems.

[18]  Michael L. Littman,et al.  A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games , 2008, UAI.

[19]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[20]  Michael L. Littman,et al.  Classes of Multiagent Q-learning Dynamics with epsilon-greedy Exploration , 2010, ICML.

[21]  Geoffrey J. Gordon,et al.  No-Regret Reductions for Imitation Learning and Structured Prediction , 2010, ArXiv.

[22]  Victor R. Lesser,et al.  Multi-Agent Learning with Policy Prediction , 2010, AAAI.

[23]  Michael A. Goodrich,et al.  Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning , 2011, Machine Learning.

[24]  W. Press,et al.  Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent , 2012, Proceedings of the National Academy of Sciences.

[25]  Jonathan L. Shapiro,et al.  Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games , 2013, ICAISC.

[26]  Peter Stone,et al.  Multiagent learning in the presence of memory-bounded agents , 2013, Autonomous Agents and Multi-Agent Systems.

[27]  Kevin Leyton-Brown,et al.  Empirically Evaluating Multiagent Learning Algorithms , 2014, ArXiv.

[28]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[29]  Joshua B. Tenenbaum,et al.  Coordinate to cooperate or compete: Abstract goals and joint intentions in social interaction , 2016, CogSci.

[30]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[31]  Rob Fergus,et al.  Learning Multiagent Communication with Backpropagation , 2016, NIPS.

[32]  David Silver,et al.  Deep Reinforcement Learning from Self-Play in Imperfect-Information Games , 2016, ArXiv.

[33]  Pablo Hernandez-Leal,et al.  A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity , 2017, ArXiv.

[34]  Shimon Whiteson,et al.  Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning , 2017, ICML.

[35]  David Pfau,et al.  Unrolled Generative Adversarial Networks , 2016, ICLR.

[36]  Alexander Peysakhovich,et al.  Maintaining cooperation in complex social dilemmas using deep reinforcement learning , 2017, ArXiv.

[37]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[38]  David Silver,et al.  A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[39]  Jonathan P. How,et al.  Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability , 2017, ICML.

[40]  Jonathan P. How,et al.  Deep Decentralized Multi-task Multi-Agent RL under Partial Observability , 2017 .

[41]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Jonathan L. Shapiro,et al.  Opponent Modeling by Expectation–Maximization and Sequence Prediction in Simplified Poker , 2017, IEEE Transactions on Computational Intelligence and AI in Games.

[43]  Joel Z. Leibo,et al.  Multi-agent Reinforcement Learning in Sequential Social Dilemmas , 2017, AAMAS.

[44]  Pablo Hernandez-Leal,et al.  Learning against sequential opponents in repeated stochastic games , 2017 .

[45]  Alexander Peysakhovich,et al.  Multi-Agent Cooperation and the Emergence of (Natural) Language , 2016, ICLR.

[46]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[47]  Pieter Abbeel,et al.  Emergence of Grounded Compositional Language in Multi-Agent Populations , 2017, AAAI.

[48]  H. Francis Song,et al.  Machine Theory of Mind , 2018, ICML.

[49]  유춘자 1991 , 1992, The Winning Cars of the Indianapolis 500.