Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning

Learning in multi-agent systems is highly challenging due to the inherent complexity introduced by agents’ interactions. We tackle systems with a huge population of interacting agents (e.g., swarms) via Mean-Field Control (MFC). MFC considers an asymptotically infinite population of identical agents that aim to collaboratively maximize the collective reward. Specifically, we consider the case of unknown system dynamics where the goal is to simultaneously optimize for the rewards and learn from experience. We propose an efficient model-based reinforcement learning algorithm M–UCRL that runs in episodes and provably solves this problem. M–UCRL uses upper-confidence bounds to balance exploration and exploitation during policy learning. Our main theoretical contributions are the first general regret bounds for model-based RL for MFC, obtained via a novel mean-field type analysis. M–UCRL can be instantiated with different models such as neural networks or Gaussian Processes, and effectively combined with neural network policy learning. We empirically demonstrate the convergence of M–UCRL on the swarm motion problem of controlling an infinite population of agents seeking to maximize location-dependent reward and avoid congested areas.

[1]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[2]  Pablo Hernandez-Leal,et al.  A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity , 2017, ArXiv.

[3]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[4]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[5]  Michal Valko,et al.  Regret Bounds for Kernel-Based Reinforcement Learning , 2020, ArXiv.

[6]  Sean P. Meyn,et al.  Learning in Mean-Field Games , 2014, IEEE Transactions on Automatic Control.

[7]  Yongxin Chen,et al.  Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games , 2019, ICLR.

[8]  Aditya Gopalan,et al.  Online Learning in Kernelized Markov Decision Processes , 2019, AISTATS.

[9]  Saeid Nahavandi,et al.  Deep Reinforcement Learning for Multiagent Systems: A Review of Challenges, Solutions, and Applications , 2018, IEEE Transactions on Cybernetics.

[10]  Olivier Guéant,et al.  Mean Field Games and Applications , 2011 .

[11]  Ronen I. Brafman,et al.  A near-optimal polynomial time algorithm for learning in certain classes of stochastic games , 2000, Artif. Intell..

[12]  René Carmona,et al.  Probabilistic Analysis of Mean-field Games , 2013 .

[13]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[14]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2019, Autonomous Agents and Multi-Agent Systems.

[15]  Felix Berkenkamp,et al.  Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning , 2020, NeurIPS.

[16]  Yoav Zemel,et al.  Statistical Aspects of Wasserstein Distances , 2018, Annual Review of Statistics and Its Application.

[17]  Joelle Pineau,et al.  Streaming kernel regression with provably adaptive mean, variance, and regularization , 2017, J. Mach. Learn. Res..

[18]  S. Kakade,et al.  Information Theoretic Regret Bounds for Online Nonlinear Control , 2020, NeurIPS.

[19]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[20]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[21]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[22]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[23]  Renyuan Xu,et al.  Dynamic Programming Principles for Mean-Field Controls with Learning , 2019, Oper. Res..

[24]  Shie Mannor,et al.  Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies , 2019, NeurIPS.

[25]  Ming Zhou,et al.  Mean Field Multi-Agent Reinforcement Learning , 2018, ICML.

[26]  Renyuan Xu,et al.  Q-Learning Algorithm for Mean-Field Controls, with Convergence and Complexity Analysis , 2020 .

[27]  Hongyuan Zha,et al.  Learning Deep Mean Field Games for Modeling Large Population Behavior , 2017, ICLR.

[28]  Byoung-Tak Zhang,et al.  Stock Trading System Using Reinforcement Learning with Cooperative Agents , 2002, ICML.

[29]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[30]  Stefano Ermon,et al.  Accurate Uncertainties for Deep Learning Using Calibrated Regression , 2018, ICML.

[31]  P. Lions,et al.  Jeux à champ moyen. I – Le cas stationnaire , 2006 .

[32]  Jonghun Park,et al.  A Multiagent Approach to $Q$-Learning for Daily Stock Trading , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[33]  Amnon Shashua,et al.  Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving , 2016, ArXiv.

[34]  Sean P. Meyn,et al.  Learning in mean-field oscillator games , 2010, 49th IEEE Conference on Decision and Control (CDC).

[35]  Baher Abdulhai,et al.  Multiagent Reinforcement Learning for Integrated Network of Adaptive Traffic Signal Controllers (MARLIN-ATSC): Methodology and Large-Scale Application on Downtown Toronto , 2013, IEEE Transactions on Intelligent Transportation Systems.

[36]  Sergey Levine,et al.  Optimism-driven exploration for nonlinear systems , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[37]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[38]  On the Convergence of Model Free Learning in Mean Field Games , 2019, AAAI.

[39]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[40]  Peter E. Caines,et al.  Large population stochastic dynamic games: closed-loop McKean-Vlasov systems and the Nash certainty equivalence principle , 2006, Commun. Inf. Syst..

[41]  Cristiano Castelfranchi,et al.  The theory of social functions: challenges for computational social science and multi-agent learning , 2001, Cognitive Systems Research.

[42]  Charafeddine Mouzouni,et al.  A Mean Field Game Of Portfolio Trading And Its Consequences On Perceived Correlations , 2019, 1902.09606.

[43]  J. Fouque,et al.  Unified reinforcement Q-learning for mean field game and control problems , 2020, Mathematics of Control, Signals, and Systems.

[44]  P. Lions,et al.  Jeux à champ moyen. II – Horizon fini et contrôle optimal , 2006 .

[45]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[46]  Vaneet Aggarwal,et al.  Reinforcement Learning for Mean Field Game , 2019, Algorithms.

[47]  Diogo Gomes,et al.  Two Numerical Approaches to Stationary Mean-Field Games , 2015, Dyn. Games Appl..

[48]  M'ed'eric Motte UPD7,et al.  Mean-field Markov decision processes with common noise and open-loop controls , 2019, The Annals of Applied Probability.

[49]  R. Carmona,et al.  Model-Free Mean-Field Reinforcement Learning: Mean-Field MDP and Mean-Field Q-Learning , 2019, The Annals of Applied Probability.

[50]  Alexis Boukouvalas,et al.  GPflow: A Gaussian Process Library using TensorFlow , 2016, J. Mach. Learn. Res..

[51]  Aditya Mahajan,et al.  Reinforcement Learning in Stationary Mean-field Games , 2019, AAMAS.

[52]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[53]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[54]  Pierre Cardaliaguet,et al.  Learning in mean field games: The fictitious play , 2015, 1507.06280.

[55]  Joel Z. Leibo,et al.  Multi-agent Reinforcement Learning in Sequential Social Dilemmas , 2017, AAMAS.

[56]  Diogo A. Gomes,et al.  Mean Field Games Models—A Brief Survey , 2013, Dynamic Games and Applications.

[57]  Daniel Lacker,et al.  Limit Theory for Controlled McKean-Vlasov Dynamics , 2016, SIAM J. Control. Optim..

[58]  Michalis K. Titsias,et al.  Variational Learning of Inducing Variables in Sparse Gaussian Processes , 2009, AISTATS.

[59]  Renyuan Xu,et al.  A General Framework for Learning Mean-Field Games , 2020, Mathematics of Operations Research.

[60]  Afshin Oroojlooyjadid,et al.  A review of cooperative multi-agent deep reinforcement learning , 2019, Applied Intelligence.

[61]  Zhuoran Yang,et al.  Breaking the Curse of Many Agents: Provable Mean Embedding Q-Iteration for Mean-Field Reinforcement Learning , 2020, ICML.

[62]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .