A New Formalism, Method and Open Issues for Zero-Shot Coordination

In many coordination problems, independently reasoning humans are able to discover mutually compatible policies. In contrast, independently trained self-play policies are often mutually incompatible. Zero-shot coordination (ZSC) has recently been proposed as a new frontier in multiagent reinforcement learning to address this fundamental issue. Prior work approaches the ZSC problem by assuming players can agree on a shared learning algorithm but not on labels for actions and observations, and proposes other-play as an optimal solution. However, until now, this “label-free” problem has only been informally defined. We formalize this setting as the label-free coordination (LFC) problem by defining the labelfree coordination game. We show that other-play is not an optimal solution to the LFC problem as it fails to consistently break ties between incompatible maximizers of the other-play objective. We introduce an extension of the algorithm, otherplay with tie-breaking, and prove that it is optimal in the LFC problem and an equilibrium in the LFC game. Since arbitrary tie-breaking is precisely what the ZSC setting aims to prevent, we conclude that the LFC problem does not reflect the aims of ZSC. To address this, we introduce an alternative informal operationalization of ZSC as a starting point for future work.

[1]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[2]  I. Glicksberg A FURTHER GENERALIZATION OF THE KAKUTANI FIXED POINT THEOREM, WITH APPLICATION TO NASH EQUILIBRIUM POINTS , 1952 .

[3]  H. W. Kuhn,et al.  11. Extensive Games and the Problem of Information , 1953 .

[4]  T. Schelling,et al.  The Strategy of Conflict. , 1961 .

[5]  George W. Polites,et al.  An introduction to the theory of groups , 1968 .

[6]  R. Kirk CONVENTION: A PHILOSOPHICAL STUDY , 1970 .

[7]  J. Harsanyi The tracing procedure: A Bayesian approach to defining a solution forn-person noncooperative games , 1975 .

[8]  John C. Harsanyi,et al.  Общая теория выбора равновесия в играх / A General Theory of Equilibrium Selection in Games , 1989 .

[9]  David Williams,et al.  Probability with Martingales , 1991, Cambridge mathematical textbooks.

[10]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[11]  R. Gibbons Game theory for applied economists , 1992 .

[12]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[13]  R. Sugden,et al.  The Nature of Salience: An Experimental Investigation of Pure Coordination Games , 1994 .

[14]  Ariel Rubinstein,et al.  A Course in Game Theory , 1995 .

[15]  Craig Boutilier,et al.  Sequential Optimality and Coordination in Multiagent Systems , 1999, IJCAI.

[16]  B. Peleg,et al.  The canonical extensive form of a game form: symmetries , 1999 .

[17]  Eric van Damme,et al.  Non-Cooperative Games , 2000 .

[18]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[19]  P. Jean-Jacques Herings,et al.  Computation of the Nash Equilibrium Selected by the Tracing Procedure in N-Person Games , 2002, Games Econ. Behav..

[20]  P. Jean-Jacques Herings,et al.  Equilibrium Selection in Stochastic Games , 2001, IGTR.

[21]  David V. Pynadath,et al.  Taming Decentralized POMDPs: Towards Efficient Policy Computation for Multiagent Settings , 2003, IJCAI.

[22]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  Claudia V. Goldman,et al.  Learning to communicate in a decentralized environment , 2007, Autonomous Agents and Multi-Agent Systems.

[25]  Frans A. Oliehoek,et al.  Dec-POMDPs and extensive form games: equivalence of models and algorithms , 2006 .

[26]  Nikos A. Vlassis,et al.  Optimal and Approximate Q-value Functions for Decentralized POMDPs , 2008, J. Artif. Intell. Res..

[27]  Sarit Kraus,et al.  Ad Hoc Autonomous Agent Teams: Collaboration without Pre-Coordination , 2010, AAAI.

[28]  Paul W. Goldberg,et al.  The Complexity of the Homotopy Method, Equilibrium Selection, and Lemke-Howson Solutions , 2010, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[29]  Sarit Kraus,et al.  Empirical evaluation of ad hoc teamwork in the pursuit domain , 2011, AAMAS.

[30]  Kee-Eung Kim,et al.  Exploiting symmetries for single- and multi-agent Partially Observable Stochastic Domains , 2012, Artif. Intell..

[31]  S. Zamir,et al.  Game Theory: Behavior strategies and Kuhn's Theorem , 2013 .

[32]  Peter Stone,et al.  Cooperating with Unknown Teammates in Complex Domains: A Robot Soccer Case Study of Ad Hoc Teamwork , 2015, AAAI.

[33]  Frans A. Oliehoek,et al.  A Concise Introduction to Decentralized POMDPs , 2016, SpringerBriefs in Intelligent Systems.

[34]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[35]  Hoong Chuin Lau,et al.  Policy Gradient With Value Function Approximation For Collective Multiagent Planning , 2018, NIPS.

[36]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[37]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[38]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[39]  Noam Brown,et al.  Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[40]  Alexander Peysakhovich,et al.  Learning Existing Social Conventions via Observationally Augmented Self-Play , 2018, AIES.

[41]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[42]  H. Francis Song,et al.  Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[43]  Shimon Whiteson,et al.  The StarCraft Multi-Agent Challenge , 2019, AAMAS.

[44]  Anca D. Dragan,et al.  On the Utility of Learning about Humans for Human-AI Coordination , 2019, NeurIPS.

[45]  Julie Shah,et al.  Adversarially Guided Self-Play for Adopting Social Conventions , 2020, ArXiv.

[46]  Jakob N. Foerster,et al.  "Other-Play" for Zero-Shot Coordination , 2020, ICML.

[47]  Caspar Oesterheld,et al.  Symmetry, Equilibria, and Robustness in Common-Payoff Games , 2021 .