Global Convergence of Policy Gradient for Linear-Quadratic Mean-Field Control/Game in Continuous Time

Reinforcement learning is a powerful tool to learn the optimal policy of possibly multiple agents by interacting with the environment. As the number of agents grow to be very large, the system can be approximated by a mean-field problem. Therefore, it has motivated new research directions for mean-field control (MFC) and mean-field game (MFG). In this paper, we study the policy gradient method for the linear-quadratic mean-field control and game, where we assume each agent has identical linear state transitions and quadratic cost functions. While most of the recent works on policy gradient for MFC and MFG are based on discrete-time models, we focus on the continuous-time models where some analyzing techniques can be interesting to the readers. For both MFC and MFG, we provide policy gradient update and show that it converges to the optimal solution at a linear rate, which is verified by a synthetic simulation. For MFG, we also provide sufficient conditions for the existence and uniqueness of the Nash equilibrium.

[1]  P. Caines,et al.  Individual and mass behaviour in large population stochastic wireless power control problems: centralized and Nash equilibrium solutions , 2003, 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).

[2]  P. Lions,et al.  Jeux à champ moyen. I – Le cas stationnaire , 2006 .

[3]  Jian Fang The LQR Controller Design of Two-Wheeled Self-Balancing Robot Based on the Particle Swarm Optimization Algorithm , 2014 .

[4]  Erfu Yang,et al.  Multiagent Reinforcement Learning for Multi-Robot Systems: A Survey , 2004 .

[5]  Jacob Engwerda,et al.  LQ Dynamic Optimization and Differential Games , 2005 .

[6]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[7]  Tamer Basar,et al.  Markov-Nash equilibria in mean-field games with discounted cost , 2016, 2017 American Control Conference (ACC).

[8]  P. Lions,et al.  Jeux à champ moyen. II – Horizon fini et contrôle optimal , 2006 .

[9]  Na Li,et al.  Linear–Quadratic Mean-Field Game for Stochastic Delayed Systems , 2018, IEEE Transactions on Automatic Control.

[10]  X. Zhou,et al.  Continuous-Time Mean-Variance Portfolio Selection: A Stochastic LQ Framework , 2000 .

[11]  Tamer Basar,et al.  Policy Optimization Provably Converges to Nash Equilibria in Zero-Sum Linear Quadratic Games , 2019, NeurIPS.

[12]  A. Bensoussan,et al.  Mean Field Games and Mean Field Type Control Theory , 2013 .

[13]  Joel Z. Leibo,et al.  Multi-agent Reinforcement Learning in Sequential Social Dilemmas , 2017, AAMAS.

[14]  Alessandro Lazaric,et al.  Learning to cooperate in multi-agent social dilemmas , 2006, AAMAS '06.

[15]  Romuald Elie,et al.  On the Convergence of Model Free Learning in Mean Field Games , 2020, AAAI.

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Sean P. Meyn,et al.  Learning in Mean-Field Games , 2014, IEEE Transactions on Automatic Control.

[18]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[19]  Maxim Raginsky,et al.  Approximate Nash Equilibria in Partially Observed Stochastic Games with Mean-Field Interactions , 2017, Math. Oper. Res..

[20]  Mathieu Lauriere,et al.  Linear-Quadratic Mean-Field Reinforcement Learning: Convergence of Policy Gradient Methods , 2019, ArXiv.

[21]  Armin Zare,et al.  Convergence and sample complexity of gradient methods for the model-free linear quadratic regulator problem , 2019, ArXiv.

[22]  Michael Wooldridge,et al.  Game Theory and Decision Theory in Multi-Agent Systems , 2002, Autonomous Agents and Multi-Agent Systems.

[23]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[24]  Xun Li,et al.  Discrete time mean-field stochastic linear-quadratic optimal control problems , 2013, Autom..

[25]  David Silver,et al.  Deep Reinforcement Learning from Self-Play in Imperfect-Information Games , 2016, ArXiv.

[26]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[27]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[28]  Karl Henrik Johansson,et al.  Stability analysis for multi-agent systems using the incidence matrix: Quantized communication and formation control , 2010, Autom..

[29]  Xun Li,et al.  Discrete-time mean-field Stochastic linear-quadratic optimal control problems, II: Infinite horizon case , 2015, Autom..

[30]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[31]  Khashayar Khorasani,et al.  Multi-agent team cooperation: A game theory approach , 2009, Autom..

[32]  S. Liberty,et al.  Linear Systems , 2010, Scientific Parallel Computing.

[33]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[34]  Yongxin Chen,et al.  Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games , 2019, ICLR.

[35]  Pierre Cardaliaguet,et al.  Learning in mean field games: The fictitious play , 2015, 1507.06280.

[36]  Mathieu Lauriere,et al.  Mean Field Control and Mean Field Game Models with Several Populations , 2018, 1810.00783.

[37]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[38]  Hamidou Tembine,et al.  Robust Mean Field Games with Application to Production of an Exhaustible Resource , 2012, ROCOND.

[39]  Marios M. Polycarpou,et al.  Cooperative Control of Distributed Multi-Agent Systems , 2001 .

[40]  Daniela Rus,et al.  Multi-robot path planning for a swarm of robots that can both fly and drive , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[41]  Peter E. Caines,et al.  Large population stochastic dynamic games: closed-loop McKean-Vlasov systems and the Nash certainty equivalence principle , 2006, Commun. Inf. Syst..

[42]  Riccardo Minciardi,et al.  Optimal Control in a Cooperative Network of Smart Power Grids , 2012, IEEE Systems Journal.

[43]  François Delarue,et al.  Probabilistic Theory of Mean Field Games with Applications I: Mean Field FBSDEs, Control, and Games , 2018 .

[44]  J. Willems Least squares stationary optimal control and the algebraic Riccati equation , 1971 .

[45]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[46]  P. Lions,et al.  Mean field games , 2007 .

[47]  Amnon Shashua,et al.  Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving , 2016, ArXiv.

[48]  Lillian J. Ratliff,et al.  Global Convergence of Policy Gradient for Sequential Zero-Sum Linear Quadratic Dynamic Games , 2019, ArXiv.

[49]  Quanyan Zhu,et al.  Risk-Sensitive Mean-Field Games , 2012, IEEE Transactions on Automatic Control.

[50]  Pingjian Zhang,et al.  Some Results On Two-Person Zero-Sum Linear Quadratic Differential Games , 2005, SIAM J. Control. Optim..

[51]  Kevin Waugh,et al.  DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[52]  B. Anderson,et al.  Optimal control: linear quadratic methods , 1990 .

[53]  Sham M. Kakade,et al.  Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.

[54]  Joel Z. Leibo,et al.  Inequity aversion improves cooperation in intertemporal social dilemmas , 2018, NeurIPS.