Calibration of Shared Equilibria in General Sum Partially Observable Markov Games

Training multi-agent systems (MAS) to achieve realistic equilibria gives us a useful tool to understand and model real-world systems. We consider a general sum partially observable Markov game where agents of different types share a single policy network, conditioned on agent-specific information. This paper aims at i) formally understanding equilibria reached by such agents, and ii) matching emergent phenomena of such equilibria to real-world targets. Parameter sharing with decentralized execution has been introduced as an efficient way to train multiple agents using a single policy network. However, the nature of resulting equilibria reached by such agents is not yet understood: we introduce the novel concept of \textit{Shared equilibrium} as a symmetric pure Nash equilibrium of a certain Functional Form Game (FFG) and prove convergence to the latter for a certain class of games using self-play. In addition, it is important that such equilibria satisfy certain constraints so that MAS are \textit{calibrated} to real world data for practical use: we solve this problem by introducing a novel dual-Reinforcement Learning based approach that fits emergent behaviors of agents in a Shared equilibrium to externally-specified targets, and apply our methods to a $n$-player market example. We do so by calibrating parameters governing distributions of agent types rather than individual agents, which allows both behavior differentiation among agents and coherent scaling of the shared policy network to multiple agents.

[1]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[2]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[3]  Max Jaderberg,et al.  Open-ended Learning in Symmetric Zero-sum Games , 2019, ICML.

[4]  Shlomo Zilberstein,et al.  Dynamic Programming for Partially Observable Stochastic Games , 2004, AAAI.

[5]  Michael P. Wellman,et al.  Structure Learning for Approximate Solution of Many-Player Games , 2020, AAAI.

[6]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[7]  Peter Duersch,et al.  Pure strategy equilibria in symmetric two-player zero-sum games , 2011, Int. J. Game Theory.

[8]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[9]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[10]  Michael I. Jordan,et al.  RLlib: Abstractions for Distributed Reinforcement Learning , 2017, ICML.

[11]  K. Madhava Krishna,et al.  Parameter Sharing Reinforcement Learning Architecture for Multi Agent Driving Behaviors , 2018, ArXiv.

[12]  David C. Parkes,et al.  The AI Economist: Improving Equality and Productivity with AI-Driven Tax Policies , 2020, ArXiv.

[13]  Shimon Whiteson,et al.  Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning , 2017, ICML.

[14]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[15]  Michael H. Bowling,et al.  Actor-Critic Policy Optimization in Partially Observable Multiagent Environments , 2018, NeurIPS.

[16]  Rob Fergus,et al.  Learning Multiagent Communication with Backpropagation , 2016, NIPS.

[17]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[18]  Peter Duersch,et al.  Pure strategy equilibria in symmetric two-player zero-sum games , 2012, Int. J. Game Theory.

[19]  A. Hefti,et al.  Equilibria in symmetric games : theory and applications , 2017 .

[20]  Amir Sani,et al.  Agent-Based Model Calibration Using Machine Learning Surrogates , 2017, 1703.10639.

[21]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[22]  Priscilla Avegliano,et al.  Using Surrogate Models to Calibrate Agent-based Model Parameters Under Data Scarcity , 2019, AAMAS.

[23]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[24]  Shie Mannor,et al.  Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning , 2017, COLT.

[25]  Karl Tuyls,et al.  Computing Approximate Equilibria in Sequential Adversarial Games by Exploitability Descent , 2019, IJCAI.

[26]  J. Tsitsiklis,et al.  Convergence rate of linear two-time-scale stochastic approximation , 2004, math/0405287.

[27]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2019, Autonomous Agents and Multi-Agent Systems.

[28]  Jitendra Malik,et al.  Learning to Optimize , 2016, ICLR.

[29]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[30]  Tianlong Chen,et al.  Learning to Optimize in Swarms , 2019, NeurIPS.

[31]  Martijn C. Schut,et al.  Reinforcement Learning for Online Control of Evolutionary Algorithms , 2006, ESOA.