Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games

Measuring and promoting policy diversity is critical for solving games with strong non-transitive dynamics where strategic cycles exist, and there is no consistent winner (e.g., Rock-Paper-Scissors). With that in mind, maintaining a pool of diverse policies via open-ended learning is an attractive solution, which can generate auto-curricula to avoid being exploited. However, in conventional open-ended learning algorithms, there are no widely accepted definitions for diversity, making it hard to construct and evaluate the diverse policies. In this work, we summarize previous concepts of diversity and work towards offering a unified measure of diversity in multi-agent open-ended learning to include all elements in Markov games, based on both Behavioral Diversity (BD) and Response Diversity (RD). At the trajectory distribution level, we re-define BD in the state-action space as the discrepancies of occupancy measures. For the reward dynamics, we propose RD to characterize diversity through the responses of policies when encountering different opponents. We also show that many current diversity measures fall in one of the categories of BD or RD but not both. With this unified diversity measure, we design the corresponding diversity-promoting objective and population effectivity when seeking the best responses in open-ended learning. We validate our methods in both relatively simple games like matrix game, non-transitive mixture model, and the complex Google Research Football environment. The population found by our methods reveals the lowest exploitability, highest population effectivity in matrix game and non-transitive mixture model, as well as the largest goal difference when interacting with opponents of various levels in Google Research Football.

[1]  David Silver,et al.  A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[2]  Avrim Blum,et al.  Planning in the Presence of Cost Functions Controlled by an Adversary , 2003, ICML.

[3]  Matthew E. Taylor,et al.  Diverse Auto-Curriculum is Critical for Successful Real-World Multiagent Learning Systems , 2021, AAMAS.

[4]  Krzysztof Choromanski,et al.  Effective Diversity in Population-Based Reinforcement Learning , 2020, NeurIPS.

[5]  Camille Couprie,et al.  GDPP: Learning Diverse Generations Using Determinantal Point Process , 2018, ICML.

[6]  Yaodong Yang,et al.  Multi-Agent Determinantal Q-Learning , 2020, ICML.

[7]  Michael H. Bowling,et al.  Apprenticeship learning using linear programming , 2008, ICML '08.

[8]  Yiannis Demiris,et al.  Random Expert Distillation: Imitation Learning via Expert Policy Support Estimation , 2019, ICML.

[9]  Qiang Fu,et al.  Towards Playing Full MOBA Games with Deep Reinforcement Learning , 2020, NeurIPS.

[10]  Yaodong Yang,et al.  An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective , 2020, ArXiv.

[11]  Yaodong Yang,et al.  Modelling Behavioural Diversity for Learning in Open-Ended Games , 2021, ICML.

[12]  Finale Doshi-Velez,et al.  Diversity-Inducing Policy Gradient: Using Maximum Mean Discrepancy to Find a Set of Diverse Policies , 2019, IJCAI.

[13]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[14]  Olivier Bachem,et al.  Google Research Football: A Novel Reinforcement Learning Environment , 2020, AAAI.

[15]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[16]  Roy Fox,et al.  Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games , 2020, NeurIPS.

[17]  Jun Wang,et al.  Multiagent Bidirectionally-Coordinated Nets for Learning to Play StarCraft Combat Games , 2017, ArXiv.

[18]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[19]  R. Arkin,et al.  Behavioral diversity in learning robot teams , 1998 .

[20]  A. Elo The rating of chessplayers, past and present , 1978 .

[21]  Mikael Henaff,et al.  Disagreement-Regularized Imitation Learning , 2020, ICLR.

[22]  Richard Zemel,et al.  A Divergence Minimization Perspective on Imitation Learning Methods , 2019, CoRL.

[23]  Pengtao Xie,et al.  Diversity-Promoting Bayesian Learning of Latent Variable Models , 2016, ICML.

[24]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[25]  Max Jaderberg,et al.  Open-ended Learning in Symmetric Zero-sum Games , 2019, ICML.

[26]  Doina Precup,et al.  Bisimulation Metrics for Continuous Markov Decision Processes , 2011, SIAM J. Comput..

[27]  Minkai Xu,et al.  Energy-Based Imitation Learning , 2021, AAMAS.

[28]  Pat Langley,et al.  Crafting Papers on Machine Learning , 2000, ICML.

[29]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[30]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[31]  Joel Z. Leibo,et al.  Quantifying environment and population diversity in multi-agent reinforcement learning , 2021, ArXiv.

[32]  Sam Devlin,et al.  A Generalized Framework for Self-Play Training , 2019, 2019 IEEE Conference on Games (CoG).