Multi-Agent Training beyond Zero-Sum with Correlated Equilibrium Meta-Solvers

Two-player, constant-sum games are well studied in the literature, but there has been limited progress outside of this setting. We propose Joint Policy-Space Response Oracles (JPSRO), an algorithm for training agents in n-player, general-sum extensive form games, which provably converges to an equilibrium. We further suggest correlated equilibria (CE) as promising meta-solvers, and propose a novel solution concept Maximum Gini Correlated Equilibrium (MGCE), a principled and computationally efficient family of solutions for solving the correlated equilibrium selection problem. We conduct several experiments using CE meta-solvers for JPSRO and demonstrate convergence on n-player, general-sum games.

[1]  Guy Lever,et al.  Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[2]  Tom Eccles,et al.  Human-Agent Cooperation in Bridge Bidding , 2020, ArXiv.

[3]  Bernhard von Stengel,et al.  Extensive-Form Correlated Equilibrium: Definition and Computational Complexity , 2008, Math. Oper. Res..

[4]  Roy Fox,et al.  Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games , 2020, NeurIPS.

[5]  C. Tsallis Possible generalization of Boltzmann-Gibbs statistics , 1988 .

[6]  G. S. Buttar,et al.  A Brief Review on Different Measures of Entropy , 2019 .

[7]  Christos H. Papadimitriou,et al.  α-Rank: Multi-Agent Evaluation by Evolution , 2019, Scientific Reports.

[8]  Michael H. Bowling,et al.  Solving Common-Payoff Games with Approximate Policy Iteration , 2021, AAAI.

[9]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[10]  Pierre Baldi,et al.  XDO: A Double Oracle Algorithm for Extensive-Form Games , 2021, ArXiv.

[11]  Paul W. Goldberg,et al.  The complexity of computing a Nash equilibrium , 2006, STOC '06.

[12]  Stephen P. Boyd,et al.  CVXPY: A Python-Embedded Modeling Language for Convex Optimization , 2016, J. Mach. Learn. Res..

[13]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[14]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[15]  Avrim Blum,et al.  Planning in the Presence of Cost Functions Controlled by an Adversary , 2003, ICML.

[16]  Bernd Gärtner,et al.  Understanding and Using Linear Programming (Universitext) , 2006 .

[17]  Guy Lever,et al.  A Generalized Training Approach for Multiagent Learning , 2020, ICLR.

[18]  A. Wald Contributions to the Theory of Statistical Estimation and Testing Hypotheses , 1939 .

[19]  Tom Eccles,et al.  Learning to Play No-Press Diplomacy with Best Response Policy Iteration , 2020, NeurIPS.

[20]  Nicola Gatti,et al.  Learning to Correlate in Multi-Player General-Sum Sequential Games , 2019, NeurIPS.

[21]  R. Aumann Subjectivity and Correlation in Randomized Strategies , 1974 .

[22]  Jonathan Gray,et al.  Human-Level Performance in No-Press Diplomacy via Equilibrium Search , 2020, ICLR.

[23]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[24]  Stephen Boyd,et al.  A Rewriting System for Convex Optimization Problems , 2017, ArXiv.

[25]  Tuomas Sandholm,et al.  Coarse Correlation in Extensive-Form Games , 2019, AAAI.

[26]  Bret Hoehn,et al.  Effective short-term opponent exploitation in simplified poker , 2005, Machine Learning.

[27]  D. O’Leary A generalized conjugate gradient algorithm for solving a class of quadratic programming problems , 1977 .

[28]  Stephen P. Boyd,et al.  OSQP: an operator splitting solver for quadratic programs , 2017, 2018 UKACC 12th International Conference on Control (CONTROL).

[29]  A. Wald Statistical Decision Functions Which Minimize the Maximum Risk , 1945 .

[30]  John C. Harsanyi,et al.  Общая теория выбора равновесия в играх / A General Theory of Equilibrium Selection in Games , 1989 .

[31]  Marc Lanctot,et al.  Further developments of extensive-form replicator dynamics using the sequence-form representation , 2014, AAMAS.

[32]  Tuomas Sandholm,et al.  Correlation in Extensive-Form Games: Saddle-Point Formulation and Benchmarks , 2019, NeurIPS.

[33]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[34]  David Silver,et al.  Fictitious Self-Play in Extensive-Form Games , 2015, ICML.

[35]  Noam Brown,et al.  Superhuman AI for multiplayer poker , 2019, Science.

[36]  Sriram Srinivasan,et al.  OpenSpiel: A Framework for Reinforcement Learning in Games , 2019, ArXiv.

[37]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[38]  Luis E. Ortiz,et al.  Maximum Entropy Correlated Equilibria , 2007, AISTATS.

[39]  D. Avis,et al.  Enumeration of Nash equilibria for two-player games , 2010 .

[40]  J. Vial,et al.  Strategically zero-sum games: The class of games whose completely mixed equilibria cannot be improved upon , 1978 .

[41]  David Silver,et al.  A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[42]  Nicola Gatti,et al.  Simple Uncoupled No-regret Learning Dynamics for Extensive-form Correlated Equilibrium , 2020, J. ACM.

[43]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[44]  J. Schreiber Foundations Of Statistics , 2016 .

[45]  Paul W. Goldberg,et al.  The Complexity of the Homotopy Method, Equilibrium Selection, and Lemke-Howson Solutions , 2010, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[46]  Jan Havrda,et al.  Quantification method of classification processes. Concept of structural a-entropy , 1967, Kybernetika.

[47]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[48]  Miroslav Dudík,et al.  A Sampling-Based Approach to Computing Equilibria in Succinct Extensive-Form Games , 2009, UAI.

[49]  Laurent El Ghaoui,et al.  Robust Optimization , 2021, ICORES.

[50]  Shu-Tao Xia,et al.  Unifying attribute splitting criteria of decision trees by Tsallis entropy , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Michael Bowling,et al.  Hindsight and Sequential Rationality of Correlated Play , 2021, AAAI.

[52]  Hans-Werner Sinn,et al.  A Rehabilitation of the Principle of Insufficient Reason , 1980 .

[53]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[54]  Pierre Hansen,et al.  On the geometry of Nash equilibria and correlated equilibria , 2003, Int. J. Game Theory.

[55]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[56]  Eric van Damme,et al.  Non-Cooperative Games , 2000 .