论文信息 - Multi-Agent Training beyond Zero-Sum with Correlated Equilibrium Meta-Solvers - 字舞流文

Multi-Agent Training beyond Zero-Sum with Correlated Equilibrium Meta-Solvers

Two-player, constant-sum games are well studied in the literature, but there has been limited progress outside of this setting. We propose Joint Policy-Space Response Oracles (JPSRO), an algorithm for training agents in n-player, general-sum extensive form games, which provably converges to an equilibrium. We further suggest correlated equilibria (CE) as promising meta-solvers, and propose a novel solution concept Maximum Gini Correlated Equilibrium (MGCE), a principled and computationally efficient family of solutions for solving the correlated equilibrium selection problem. We conduct several experiments using CE meta-solvers for JPSRO and demonstrate convergence on n-player, general-sum games.

Thore Graepel | Karl Tuyls | Marc Lanctot | Paul Muller | Luke Marris | T. Graepel | Marc Lanctot | K. Tuyls | Luke Marris | Paul Muller

[1] Guy Lever,et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[2] Tom Eccles,et al. Human-Agent Cooperation in Bridge Bidding , 2020, ArXiv.

[3] Bernhard von Stengel,et al. Extensive-Form Correlated Equilibrium: Definition and Computational Complexity , 2008, Math. Oper. Res..

[4] Roy Fox,et al. Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games , 2020, NeurIPS.

[5] C. Tsallis. Possible generalization of Boltzmann-Gibbs statistics , 1988 .

[6] G. S. Buttar,et al. A Brief Review on Different Measures of Entropy , 2019 .

[7] Christos H. Papadimitriou,et al. α-Rank: Multi-Agent Evaluation by Evolution , 2019, Scientific Reports.

[8] Michael H. Bowling,et al. Solving Common-Payoff Games with Approximate Policy Iteration , 2021, AAAI.

[9] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[10] Pierre Baldi,et al. XDO: A Double Oracle Algorithm for Extensive-Form Games , 2021, ArXiv.

[11] Paul W. Goldberg,et al. The complexity of computing a Nash equilibrium , 2006, STOC '06.

[12] Stephen P. Boyd,et al. CVXPY: A Python-Embedded Modeling Language for Convex Optimization , 2016, J. Mach. Learn. Res..

[13] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[14] Christopher M. Bishop,et al. Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[15] Avrim Blum,et al. Planning in the Presence of Cost Functions Controlled by an Adversary , 2003, ICML.

[16] Bernd Gärtner,et al. Understanding and Using Linear Programming (Universitext) , 2006 .

[17] Guy Lever,et al. A Generalized Training Approach for Multiagent Learning , 2020, ICLR.

[18] A. Wald. Contributions to the Theory of Statistical Estimation and Testing Hypotheses , 1939 .

[19] Tom Eccles,et al. Learning to Play No-Press Diplomacy with Best Response Policy Iteration , 2020, NeurIPS.

[20] Nicola Gatti,et al. Learning to Correlate in Multi-Player General-Sum Sequential Games , 2019, NeurIPS.

[21] R. Aumann. Subjectivity and Correlation in Randomized Strategies , 1974 .

[22] Jonathan Gray,et al. Human-Level Performance in No-Press Diplomacy via Equilibrium Search , 2020, ICLR.

[23] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[24] Stephen Boyd,et al. A Rewriting System for Convex Optimization Problems , 2017, ArXiv.

[25] Tuomas Sandholm,et al. Coarse Correlation in Extensive-Form Games , 2019, AAAI.

[26] Bret Hoehn,et al. Effective short-term opponent exploitation in simplified poker , 2005, Machine Learning.

[27] D. O’Leary. A generalized conjugate gradient algorithm for solving a class of quadratic programming problems , 1977 .

[28] Stephen P. Boyd,et al. OSQP: an operator splitting solver for quadratic programs , 2017, 2018 UKACC 12th International Conference on Control (CONTROL).

[29] A. Wald. Statistical Decision Functions Which Minimize the Maximum Risk , 1945 .

[30] John C. Harsanyi,et al. Общая теория выбора равновесия в играх / A General Theory of Equilibrium Selection in Games , 1989 .

[31] Marc Lanctot,et al. Further developments of extensive-form replicator dynamics using the sequence-form representation , 2014, AAMAS.

[32] Tuomas Sandholm,et al. Correlation in Extensive-Form Games: Saddle-Point Formulation and Benchmarks , 2019, NeurIPS.

[33] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[34] David Silver,et al. Fictitious Self-Play in Extensive-Form Games , 2015, ICML.

[35] Noam Brown,et al. Superhuman AI for multiplayer poker , 2019, Science.

[36] Sriram Srinivasan,et al. OpenSpiel: A Framework for Reinforcement Learning in Games , 2019, ArXiv.

[37] Wei-Yin Loh,et al. Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[38] Luis E. Ortiz,et al. Maximum Entropy Correlated Equilibria , 2007, AISTATS.

[39] D. Avis,et al. Enumeration of Nash equilibria for two-player games , 2010 .

[40] J. Vial,et al. Strategically zero-sum games: The class of games whose completely mixed equilibria cannot be improved upon , 1978 .

[41] David Silver,et al. A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[42] Nicola Gatti,et al. Simple Uncoupled No-regret Learning Dynamics for Extensive-form Correlated Equilibrium , 2020, J. ACM.

[43] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[44] J. Schreiber. Foundations Of Statistics , 2016 .

[45] Paul W. Goldberg,et al. The Complexity of the Homotopy Method, Equilibrium Selection, and Lemke-Howson Solutions , 2010, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[46] Jan Havrda,et al. Quantification method of classification processes. Concept of structural a-entropy , 1967, Kybernetika.

[47] Jakub W. Pachocki,et al. Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[48] Miroslav Dudík,et al. A Sampling-Based Approach to Computing Equilibria in Succinct Extensive-Form Games , 2009, UAI.

[49] Laurent El Ghaoui,et al. Robust Optimization , 2021, ICORES.

[50] Shu-Tao Xia,et al. Unifying attribute splitting criteria of decision trees by Tsallis entropy , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51] Michael Bowling,et al. Hindsight and Sequential Rationality of Correlated Play , 2021, AAAI.

[52] Hans-Werner Sinn,et al. A Rehabilitation of the Principle of Insufficient Reason , 1980 .

[53] E. Jaynes. Information Theory and Statistical Mechanics , 1957 .

[54] Pierre Hansen,et al. On the geometry of Nash equilibria and correlated equilibria , 2003, Int. J. Game Theory.

[55] Jorge Nocedal,et al. A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[56] Eric van Damme,et al. Non-Cooperative Games , 2000 .