LIGS: Learnable Intrinsic-Reward Generation Selection for Multi-Agent Learning

Efficient exploration is important for reinforcement learners to achieve high rewards. In multi-agent systems, coordinated exploration and behaviour is critical for agents to jointly achieve optimal outcomes. In this paper, we introduce a new general framework for improving coordination and performance of multi-agent reinforcement learners (MARL). Our framework, named Learnable Intrinsic-Reward Generation Selection algorithm (LIGS) introduces an adaptive learner, Generator that observes the agents and learns to construct intrinsic rewards online that coordinate the agents’ joint exploration and joint behaviour. Using a novel combination of MARL and switching controls, LIGS determines the best states to learn to add intrinsic rewards which leads to a highly efficient learning process. LIGS can subdivide complex tasks making them easier to solve and enables systems of MARL agents to quickly solve environments with sparse rewards. LIGS can seamlessly adopt existing MARL algorithms and, our theory shows that it ensures convergence to policies that deliver higher system performance. We demonstrate its superior performance in challenging tasks in Foraging and StarCraft II.

[1]  Yu Wang,et al.  The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games , 2021, NeurIPS.

[2]  Tim C. Green,et al.  Multi-Agent Reinforcement Learning for Active Voltage Control on Power Distribution Networks , 2021, NeurIPS.

[3]  Yaodong Yang,et al.  On the complexity of computing Markov perfect equilibrium in general-sum stochastic games , 2021, Electron. Colloquium Comput. Complex..

[4]  Yaodong Yang,et al.  Settling the Variance of Multi-Agent Policy Gradients , 2021, NeurIPS.

[5]  Ying Wen,et al.  Learning in Nonzero-Sum Stochastic Games with Potentials , 2021, ICML.

[6]  Chongjie Zhang,et al.  QPLEX: Duplex Dueling Multi-Agent Q-Learning , 2020, ICLR.

[7]  Yunjie Gu,et al.  Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue System , 2020, ICLR.

[8]  Goran Strbac,et al.  Multi-Agent Reinforcement Learning for Automated Peer-to-Peer Energy Trading in Double-Side Auction Market , 2021, IJCAI.

[9]  Yaodong Yang,et al.  An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective , 2020, ArXiv.

[10]  Dong Chen,et al.  SMARTS: Scalable Multi-Agent Reinforcement Learning Training School for Autonomous Driving , 2020, ArXiv.

[11]  Yuk Ying Chung,et al.  Learning Implicit Credit Assignment for Multi-Agent Actor-Critic , 2020, ArXiv.

[12]  Lukas Schäfer,et al.  Comparative Evaluation of Multi-Agent Deep Reinforcement Learning Algorithms , 2020, ArXiv.

[13]  Yaodong Yang,et al.  Multi-Agent Determinantal Q-Learning , 2020, ICML.

[14]  Chongjie Zhang,et al.  Towards Understanding Linear Value Decomposition in Cooperative Multi-Agent Q-Learning , 2020, ArXiv.

[15]  Yunjie Gu,et al.  Shapley Q-Value: A Local Reward Approach to Solve Global Reward Games , 2019, AAAI.

[16]  Giovanni Montana,et al.  Improving coordination in small-scale multi-agent deep reinforcement learning through memory-driven communication , 2019, Mach. Learn..

[17]  Shimon Whiteson,et al.  MAVEN: Multi-Agent Variational Exploration , 2019, NeurIPS.

[18]  Yung Yi,et al.  QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning , 2019, ICML.

[19]  David Mguni,et al.  Cutting Your Losses: Learning Fault-Tolerant Control and Optimal Stopping under Adverse Risk , 2019, ArXiv.

[20]  Shimon Whiteson,et al.  The StarCraft Multi-Agent Challenge , 2019, AAMAS.

[21]  S. Shreve,et al.  Stochastic differential equations , 1955, Mathematical Proceedings of the Cambridge Philosophical Society.

[22]  Jun Wang,et al.  Efficient Ridesharing Order Dispatching with Mean Field Multi-Agent Reinforcement Learning , 2019, WWW.

[23]  Sergio Valcarcel Macua,et al.  Coordinating the Crowd: Inducing Desirable Equilibria in Non-Cooperative Systems , 2019, AAMAS.

[24]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[25]  Murray Shanahan,et al.  Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[26]  Lei Han,et al.  LIIR: Learning Individual Intrinsic Reward in Multi-Agent Reinforcement Learning , 2019, NeurIPS.

[27]  Satinder Singh,et al.  On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[28]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[29]  D. Mguni,et al.  A Viscosity Approach to Stochastic Differential Games of Control and Stopping Involving Impulsive Control , 2018, 1803.11432.

[30]  Enrique Munoz de Cote,et al.  Decentralised Learning in Systems with Many, Many Strategic Agents , 2018, AAAI.

[31]  Santiago Zazo,et al.  Learning Parametric Closed-Loop Policies for Markov Potential Games , 2018, ICLR.

[32]  Lantao Yu,et al.  A Study of AI Population Dynamics with Million-agent Reinforcement Learning , 2017, AAMAS.

[33]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[34]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[35]  Sam Devlin,et al.  Reward shaping for knowledge-based multi-objective multi-agent reinforcement learning , 2018, The Knowledge Engineering Review.

[36]  Sam Devlin,et al.  Policy invariance under reward transformations for multi-objective reinforcement learning , 2017, Neurocomputing.

[37]  Gerhard Neumann,et al.  Guided Deep Reinforcement Learning for Swarm Systems , 2017, ArXiv.

[38]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[39]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[40]  Peng Peng,et al.  Multiagent Bidirectionally-Coordinated Nets: Emergence of Human-level Coordination in Learning to Play StarCraft Combat Games , 2017, 1703.10069.

[41]  Jun Wang,et al.  Multiagent Bidirectionally-Coordinated Nets for Learning to Play StarCraft Combat Games , 2017, ArXiv.

[42]  Marco Wiering,et al.  Comparing exploration strategies for Q-learning in random stochastic mazes , 2016, 2016 IEEE Symposium Series on Computational Intelligence (SSCI).

[43]  Traian Rebedea,et al.  Playing Atari Games with Deep Reinforcement Learning and Human Checkpoint Replay , 2016, ArXiv.

[44]  Christoph Manss,et al.  Decentralized multi-agent exploration with online-learning of Gaussian processes , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[45]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[46]  Sam Devlin,et al.  Plan-based reward shaping for multi-agent reinforcement learning , 2016, The Knowledge Engineering Review.

[47]  Sam Devlin,et al.  Expressing Arbitrary Reward Functions as Potential-Based Advice , 2015, AAAI.

[48]  Maryam Sadeghlou,et al.  Dynamic agent-based reward shaping for multi-agent systems , 2014, 2014 Iranian Conference on Intelligent Systems (ICIS).

[49]  Sam Devlin,et al.  Dynamic potential-based reward shaping , 2012, AAMAS.

[50]  Guillaume J. Laurent,et al.  Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[51]  Sam Devlin,et al.  An Empirical Study of Potential-Based Reward Shaping and Advice in Complex, Multi-Agent Systems , 2011, Adv. Complex Syst..

[52]  Sam Devlin,et al.  Theoretical considerations of potential-based reward shaping for multi-agent systems , 2011, AAMAS.

[53]  Erhan Bayraktar,et al.  On the One-Dimensional Optimal Switching Problem , 2007, Math. Oper. Res..

[54]  Yoav Shoham,et al.  Multiagent Systems - Algorithmic, Game-Theoretic, and Logical Foundations , 2009 .

[55]  Julia Donaldson,et al.  The big match , 2008 .

[56]  T. Roughgarden,et al.  Algorithmic Game Theory: Introduction to the Inefficiency of Equilibria , 2007 .

[57]  Michael L. Littman,et al.  Cyclic Equilibria in Markov Games , 2005, NIPS.

[58]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[59]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[60]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[61]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.