Learning While Playing in Mean-Field Games: Convergence and Optimality

We study reinforcement learning in mean-field games. To achieve the Nash equilibrium, which consists of a policy and a mean-field state, existing algorithms require obtaining the optimal policy while fixing any mean-field state. In practice, however, the policy and the mean-field state evolve simultaneously, as each agent is learning while playing. To bridge such a gap, we propose a fictitious play algorithm, which alternatively updates the policy (learning) and the mean-field state (playing) by one step of policy optimization and gradient descent, respectively. Despite the nonstationarity induced by such an alternating scheme, we prove that the proposed algorithm converges to the Nash equilibrium with an explicit convergence rate. To the best of our knowledge, it is the first provably efficient algorithm that achieves learning while playing.

[1]  Pablo Hernandez-Leal,et al.  A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity , 2017, ArXiv.

[2]  Romuald Elie,et al.  Fictitious Play for Mean Field Games: Continuous Time Analysis and Applications , 2020, NeurIPS.

[3]  Hamidou Tembine,et al.  Mean field difference games: McKean-Vlasov dynamics , 2011, IEEE Conference on Decision and Control and European Control Conference.

[4]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[5]  P. Lions,et al.  Jeux à champ moyen. I – Le cas stationnaire , 2006 .

[6]  O. H. Brownlee,et al.  ACTIVITY ANALYSIS OF PRODUCTION AND ALLOCATION , 1952 .

[7]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[8]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[9]  Sean P. Meyn,et al.  Learning in Mean-Field Games , 2014, IEEE Transactions on Automatic Control.

[10]  Aditya Mahajan,et al.  Reinforcement Learning in Stationary Mean-field Games , 2019, AAMAS.

[11]  Naci Saldi,et al.  Value Iteration Algorithm for Mean-field Games , 2019, ArXiv.

[12]  Tamer Basar,et al.  Discrete-time LQG mean field games with unreliable communication , 2014, 53rd IEEE Conference on Decision and Control.

[13]  Olivier Guéant,et al.  Mean Field Games and Applications , 2011 .

[14]  Shie Mannor,et al.  Regularized Policy Iteration with Nonparametric Function Spaces , 2016, J. Mach. Learn. Res..

[15]  Ding-Xuan Zhou,et al.  Distributed Learning with Regularized Least Squares , 2016, J. Mach. Learn. Res..

[16]  Mathieu Lauriere,et al.  Unified reinforcement Q-learning for mean field game and control problems , 2022, Mathematics of Control, Signals, and Systems.

[17]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[18]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[19]  Ramesh Johari,et al.  Equilibria of Dynamic Games with Many Players: Existence, Approximation, and Market Structure , 2010, J. Econ. Theory.

[20]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[21]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[22]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[23]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[24]  Qiaomin Xie,et al.  Dynamic Regret of Policy Optimization in Non-stationary Environments , 2020, NeurIPS.

[25]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[26]  Romuald Elie,et al.  On the Convergence of Model Free Learning in Mean Field Games , 2020, AAAI.

[27]  Robert Babuska,et al.  Decentralized Reinforcement Learning of Robot Behaviors , 2018, Artif. Intell..

[28]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[29]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[30]  Joel Z. Leibo,et al.  Multi-agent Reinforcement Learning in Sequential Social Dilemmas , 2017, AAMAS.

[31]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[32]  Shimon Whiteson,et al.  Multiagent Reinforcement Learning for Urban Traffic Control Using Coordination Graphs , 2008, ECML/PKDD.

[33]  T. Başar,et al.  Dynamic Noncooperative Game Theory , 1982 .

[34]  P. Lions,et al.  Mean field games , 2007 .

[35]  Jim Duggan,et al.  An Experimental Review of Reinforcement Learning Algorithms for Adaptive Traffic Signal Control , 2016, Autonomic Road Transport Support Systems.

[36]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[37]  Barnabás Póczos,et al.  Two-stage sampled learning theory on distributions , 2015, AISTATS.

[38]  Bart De Schutter,et al.  Decentralized Reinforcement Learning Control of a Robotic Manipulator , 2006, 2006 9th International Conference on Control, Automation, Robotics and Vision.

[39]  Barbara Messing,et al.  An Introduction to MultiAgent Systems , 2002, Künstliche Intell..

[40]  Piotr Więcek,et al.  Discrete-Time Ergodic Mean-Field Games with Average Reward on Compact Spaces , 2019, Dynamic Games and Applications.

[41]  Nando de Freitas,et al.  Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning , 2018, ICML.

[42]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[43]  Shie Mannor,et al.  Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs , 2020, AAAI.

[44]  Erfu Yang,et al.  Multiagent Reinforcement Learning for Multi-Robot Systems: A Survey , 2004 .

[45]  Naci Saldi,et al.  Fitted Q-Learning in Mean-field Games , 2019, ArXiv.

[46]  Bernhard Schölkopf,et al.  Learning from Distributions via Support Measure Machines , 2012, NIPS.

[47]  Renyuan Xu,et al.  A General Framework for Learning Mean-Field Games , 2020, Mathematics of Operations Research.

[48]  P. Caines,et al.  Individual and mass behaviour in large population stochastic wireless power control problems: centralized and Nash equilibrium solutions , 2003, 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).

[49]  Yingke Chen,et al.  Decision-Theoretic Planning Under Anonymity in Agent Populations , 2017, J. Artif. Intell. Res..

[50]  Tamer Basar,et al.  Markov-Nash equilibria in mean-field games with discounted cost , 2016, 2017 American Control Conference (ACC).

[51]  P. Lions,et al.  Jeux à champ moyen. II – Horizon fini et contrôle optimal , 2006 .

[52]  Yoav Shoham,et al.  If multi-agent learning is the answer, what is the question? , 2007, Artif. Intell..

[53]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[54]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[55]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[56]  Tamer Basar,et al.  Approximate Equilibrium Computation for Discrete-Time Linear-Quadratic Mean-Field Games , 2020, 2020 American Control Conference (ACC).

[57]  Jason D. Lee,et al.  Neural Temporal-Difference and Q-Learning Provably Converge to Global Optima. , 2019, 1905.10027.

[58]  David C. Parkes,et al.  Learning to Collaborate in Markov Decision Processes , 2019, ICML.

[59]  Bernhard Schölkopf,et al.  Kernel Mean Embedding of Distributions: A Review and Beyonds , 2016, Found. Trends Mach. Learn..

[60]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[61]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[62]  Jalaj Bhandari,et al.  Global Optimality Guarantees For Policy Gradient Methods , 2019, ArXiv.

[63]  Stephen Clark,et al.  Emergent Communication through Negotiation , 2018, ICLR.

[64]  Tamer Basar,et al.  Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms , 2019, Handbook of Reinforcement Learning and Control.

[65]  Yongxin Chen,et al.  Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games , 2019, ICLR.

[66]  Jalaj Bhandari,et al.  A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , 2018, COLT.

[67]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[68]  Joel Z. Leibo,et al.  Social Diversity and Social Preferences in Mixed-Motive Reinforcement Learning , 2020, AAMAS.

[69]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2018, Autonomous Agents and Multi-Agent Systems.

[70]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[71]  Devavrat Shah,et al.  Q-learning with Nearest Neighbors , 2018, NeurIPS.

[72]  J. Nash Equilibrium Points in N-Person Games. , 1950, Proceedings of the National Academy of Sciences of the United States of America.

[73]  Yuxin Chen,et al.  Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization , 2020, Oper. Res..

[74]  D. Gomes,et al.  Discrete Time, Finite State Space Mean Field Games , 2010 .