Learning Nash Equilibria in Zero-Sum Stochastic Games via Entropy-Regularized Policy Approximation

We explore the use of policy approximation for reducing the computational cost of learning Nash equilibria in multi-agent reinforcement learning scenarios. We propose a new algorithm for zero-sum stochastic games in which each agent simultaneously learns a Nash policy and an entropy-regularized policy. The two policies help each other towards convergence: the former guides the latter to the desired Nash equilibrium, while the latter serves as an efficient approximation of the former. We demonstrate the possibility of using the proposed algorithm to transfer previous training experiences to different environments, enabling the agents to adapt quickly to new tasks. We also provide a dynamic hyper-parameter scheduling scheme for further expedited convergence. Empirical results applied to a number of stochastic games show that the proposed algorithm converges to the Nash equilibrium while exhibiting a major speed-up over existing algorithms.

[1]  M. Dufwenberg Game theory. , 2011, Wiley interdisciplinary reviews. Cognitive science.

[2]  Bruno Scherrer,et al.  Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games , 2015, ICML.

[3]  Yoshimasa Tsuruoka,et al.  Neural Fictitious Self-Play in Imperfect Information Games with Many Players , 2017, CGW@IJCAI.

[4]  Paul W. Goldberg,et al.  The complexity of computing a Nash equilibrium , 2006, STOC '06.

[5]  Xiaofeng Wang,et al.  Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games , 2002, NIPS.

[6]  Jonathan P. How,et al.  Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability , 2017, ICML.

[7]  Ming Zhou,et al.  Mean Field Multi-Agent Reinforcement Learning , 2018, ICML.

[8]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[9]  Panagiotis Tsiotras,et al.  Bounded-Rational Pursuit-Evasion Games , 2020, 2021 American Control Conference (ACC).

[10]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[11]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[12]  Mykel J. Kochenderfer,et al.  Cooperative Multi-agent Control Using Deep Reinforcement Learning , 2017, AAMAS Workshops.

[13]  Yi Wu,et al.  Robust Multi-Agent Reinforcement Learning via Minimax Deep Deterministic Policy Gradient , 2019, AAAI.

[14]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[15]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[16]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[17]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[18]  Eric van Damme,et al.  Non-Cooperative Games , 2000 .

[19]  B. Averbeck,et al.  Reinforcement learning in artificial and biological systems , 2019, Nature Machine Intelligence.

[20]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[21]  Shimon Whiteson,et al.  Learning with Opponent-Learning Awareness , 2017, AAMAS.

[22]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[23]  Manuela M. Veloso,et al.  Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[24]  Dorian Kodelja,et al.  Multiagent cooperation and competition with deep reinforcement learning , 2015, PloS one.

[25]  Guillaume J. Laurent,et al.  Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[26]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[27]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[28]  M. Stanković Multi-agent reinforcement learning , 2016 .

[29]  Haitham Bou-Ammar,et al.  Balancing Two-Player Stochastic Games with Soft Q-Learning , 2018, IJCAI.

[30]  John N. Tsitsiklis,et al.  Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[31]  Peter Stone,et al.  A polynomial-time nash equilibrium algorithm for repeated games , 2003, EC '03.

[32]  Martin Lauer,et al.  An Algorithm for Distributed Reinforcement Learning in Cooperative Multi-Agent Systems , 2000, ICML.

[33]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[34]  Sean Luke,et al.  Cooperative Multi-Agent Learning: The State of the Art , 2005, Autonomous Agents and Multi-Agent Systems.

[35]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[36]  Guillaume J. Laurent,et al.  Hysteretic q-learning :an algorithm for decentralized reinforcement learning in cooperative multi-agent teams , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[37]  Michail G. Lagoudakis,et al.  Value Function Approximation in Zero-Sum Markov Games , 2002, UAI.

[38]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[39]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.