论文信息 - Neural Replicator Dynamics - 字舞流文

Neural Replicator Dynamics

Policy gradient and actor-critic algorithms form the basis of many commonly used training techniques in deep reinforcement learning. Using these algorithms in multiagent environments poses problems such as nonstationarity and instability. In this paper, we first demonstrate that standard softmax-based policy gradient can be prone to poor performance in the presence of even the most benign nonstationarity. By contrast, it is known that the replicator dynamics, a well-studied model from evolutionary game theory, eliminates dominated strategies and exhibits convergence of the time-averaged trajectories to interior Nash equilibria in zero-sum games. Thus, using the replicator dynamics as a foundation, we derive an elegant one-line change to policy gradient methods that simply bypasses the gradient step through the softmax, yielding a new algorithm titled Neural Replicator Dynamics (NeuRD). NeuRD reduces to the exponential weights/Hedge algorithm in the single-state all-actions case. Additionally, NeuRD has formal equivalence to softmax counterfactual regret minimization, which guarantees convergence in the sequential tabular case. Importantly, our algorithm provides a straightforward way of extending the replicator dynamics to the function approximation setting. Empirical results show that NeuRD quickly adapts to nonstationarities, outperforming policy gradient significantly in both tabular and function approximation settings, when evaluated on the standard imperfect information benchmarks of Kuhn Poker, Leduc Poker, and Goofspiel.

Rémi Munos | Karl Tuyls | Daniel Hennes | Marc Lanctot | Shayegan Omidshafiei | Dustin Morrill | Julien Pérolat | Jean-Baptiste Lespiau | Audrunas Gruslys | R. Munos | Marc Lanctot | A. Gruslys | K. Tuyls | J. Pérolat | J. Lespiau | Shayegan Omidshafiei | Dustin Morrill | Daniel Hennes | Jean-Baptiste Lespiau

[1] Y. Mansour,et al. Algorithmic Game Theory: Learning, Regret Minimization, and Equilibria , 2007 .

[2] David Silver,et al. Deep Reinforcement Learning from Self-Play in Imperfect-Information Games , 2016, ArXiv.

[3] A. Rustichini. Optimal Properties of Stimulus-Response Learning Models* , 1999 .

[4] E. C. Zeeman,et al. Population dynamics from game theory , 1980 .

[5] Karl Tuyls,et al. Computing Approximate Equilibria in Sequential Adversarial Games by Exploitability Descent , 2019, IJCAI.

[6] Rahul Savani,et al. Negative Update Intervals in Deep Multi-Agent Reinforcement Learning , 2018, AAMAS.

[7] Neil Burch,et al. Time and Space: Why Imperfect Information Games are Hard , 2018 .

[8] Tuomas Sandholm,et al. Dynamic Thresholding and Pruning for Regret Minimization , 2017, AAAI.

[9] Yoav Shoham,et al. If multi-agent learning is the answer, what is the question? , 2007, Artif. Intell..

[10] James R. Wright,et al. Bounds for Approximate Regret-Matching Algorithms , 2019, ArXiv.

[11] Marcello Restelli,et al. Sequence-Form and Evolutionary Dynamics: Realization Equivalence to Agent Form and Logit Dynamics , 2016, AAAI.

[12] Éva Tardos,et al. Multiplicative updates outperform generic no-regret learning in congestion games: extended abstract , 2009, STOC '09.

[13] Frans A. Oliehoek,et al. A Concise Introduction to Decentralized POMDPs , 2016, SpringerBriefs in Intelligent Systems.

[14] Michael H. Bowling,et al. Actor-Critic Policy Optimization in Partially Observable Multiagent Environments , 2018, NeurIPS.

[15] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[16] Martin Schmid,et al. Revisiting CFR+ and Alternating Updates , 2018, J. Artif. Intell. Res..

[17] Shane Legg,et al. Symmetric Decomposition of Asymmetric Games , 2017, Scientific Reports.

[18] Shimon Whiteson,et al. Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[19] Yi Wu,et al. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[20] Michael H. Bowling,et al. Convergence and No-Regret in Multiagent Learning , 2004, NIPS.

[21] Marcello Restelli,et al. Evolutionary Dynamics of Q-Learning over the Sequence Form , 2014, AAAI.

[22] Kevin Waugh,et al. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[23] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[24] Guillaume J. Laurent,et al. Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[25] Shane Legg,et al. DeepMind Lab , 2016, ArXiv.

[26] Yoram Singer,et al. A primal-dual perspective of online learning algorithms , 2007, Machine Learning.

[27] Manfred K. Warmuth,et al. The Weighted Majority Algorithm , 1994, Inf. Comput..

[28] Shimon Whiteson,et al. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[29] Michael H. Bowling,et al. Bayes' Bluff: Opponent Modelling in Poker , 2005, UAI 2005.

[30] Michael Bowling,et al. Alternative Function Approximation Parameterizations for Solving Games: An Analysis of f-Regression Counterfactual Regret Minimization , 2020, AAMAS.

[31] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[32] Marc Lanctot,et al. Further developments of extensive-form replicator dynamics using the sequence-form representation , 2014, AAMAS.

[33] Tilman Börgers,et al. Learning Through Reinforcement and Replicator Dynamics , 1997 .

[34] Neil Burch,et al. Heads-up limit hold’em poker is solved , 2015, Science.

[35] Karl Tuyls,et al. alpha-Rank: Multi-Agent Evaluation by Evolution , 2019 .

[36] Michael H. Bowling,et al. Monte carlo sampling and regret minimization for equilibrium computation and decision-making in large extensive form games , 2013 .

[37] Yishay Mansour,et al. Nash Convergence of Gradient Dynamics in General-Sum Games , 2000, UAI.

[38] Marc Teboulle,et al. Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[39] Angelia Nedic,et al. On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging , 2013, SIAM J. Optim..

[40] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[41] Gerhard Weiss,et al. Multiagent Learning: Basics, Challenges, and Prospects , 2012, AI Mag..

[42] Karl Tuyls,et al. Evolutionary Dynamics of Multi-Agent Learning: A Survey , 2015, J. Artif. Intell. Res..

[43] Alex Graves,et al. Automated Curriculum Learning for Neural Networks , 2017, ICML.

[44] Joel Z. Leibo,et al. A Generalised Method for Empirical Game Theoretic Analysis , 2018, AAMAS.

[45] Karl Tuyls,et al. An Evolutionary Dynamical Analysis of Multi-Agent Learning in Iterated Games , 2005, Autonomous Agents and Multi-Agent Systems.

[46] Pablo Hernandez-Leal,et al. A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity , 2017, ArXiv.

[47] Sean Luke,et al. Cooperative Multi-Agent Learning: The State of the Art , 2005, Autonomous Agents and Multi-Agent Systems.

[48] Matthias Rauterberg,et al. State-coupled replicator dynamics , 2009, AAMAS.

[49] R. Cressman,et al. Strong stability and evolutionarily stable strategies with two types of players , 1991 .

[50] Simon Parsons,et al. What evolutionary game theory tells us about multiagent learning , 2007, Artif. Intell..

[51] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[52] Noam Brown,et al. Superhuman AI for multiplayer poker , 2019, Science.

[53] Karl Tuyls,et al. Frequency adjusted multi-agent Q-learning , 2010, AAMAS.

[54] Marcello Restelli,et al. Efficient Evolutionary Dynamics with Extensive-Form Games , 2013, AAAI.

[55] Shai Shalev-Shwartz,et al. Online learning: theory, algorithms and applications (למידה מקוונת.) , 2007 .

[56] Jakub W. Pachocki,et al. Emergent Complexity via Multi-Agent Competition , 2017, ICLR.

[57] Sherief Abdallah,et al. Addressing Environment Non-Stationarity by Repeating Q-learning Updates , 2016, J. Mach. Learn. Res..

[58] Rahul Savani,et al. Lenient Multi-Agent Deep Reinforcement Learning , 2017, AAMAS.

[59] Pieter Abbeel,et al. Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments , 2017, ICLR.

[60] Shimon Whiteson,et al. Learning with Opponent-Learning Awareness , 2017, AAMAS.

[61] Tom Lenaerts,et al. An Evolutionary Game Theoretic Perspective on Learning in Multi-Agent Systems , 2004, Synthese.

[62] Noam Brown,et al. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[63] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[64] Christos H. Papadimitriou,et al. α-Rank: Multi-Agent Evaluation by Evolution , 2019, Scientific Reports.

[65] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[66] Josef Hofbauer,et al. Evolutionary Games and Population Dynamics , 1998 .

[67] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[68] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[69] M. Nowak,et al. Evolutionary game theory , 1995, Current Biology.

[70] Tuomas Sandholm,et al. Deep Counterfactual Regret Minimization , 2018, ICML.

[71] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[72] Shane Legg,et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[73] Martin Wattenberg,et al. Ad click prediction: a view from the trenches , 2013, KDD.

[74] Georgios Piliouras,et al. From Poincaré Recurrence to Convergence in Imperfect Information Games: Finding Equilibrium via Regularization , 2020, ICML.

[75] Karl Tuyls,et al. Evolutionary Dynamics of Regret Minimization , 2010, ECML/PKDD.

[76] Kevin Waugh,et al. Solving Games with Functional Regret Estimation , 2014, AAAI Workshop: Computer Poker and Imperfect Information.

[77] Manuela M. Veloso,et al. Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[78] H. Brendan McMahan,et al. Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization , 2011, AISTATS.

[79] Jan Ramon,et al. An evolutionary game-theoretic analysis of poker strategies , 2009, Entertain. Comput..

[80] Bart De Schutter,et al. A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[81] Daniel Friedman,et al. Evolutionary Games in Natural, Social, and Virtual Worlds , 2016 .

[82] Kevin Waugh,et al. Abstraction in Large Extensive Games , 2009 .

[83] J. M. Smith,et al. The Logic of Animal Conflict , 1973, Nature.

[84] P. Taylor,et al. Evolutionarily Stable Strategies and Game Dynamics , 1978 .

[85] Sriram Srinivasan,et al. OpenSpiel: A Framework for Reinforcement Learning in Games , 2019, ArXiv.

[86] H. Francis Song,et al. The Hanabi Challenge: A New Frontier for AI Research , 2019, Artif. Intell..

[87] Michael H. Bowling,et al. Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[88] Tom Lenaerts,et al. A selection-mutation model for q-learning in multi-agent systems , 2003, AAMAS '03.

[89] Christos H. Papadimitriou,et al. Cycles in adversarial regularized learning , 2017, SODA.

[90] E. Zeeman. Dynamics of the evolution of animal conflicts , 1981 .

[91] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[92] R. Cressman. Evolutionary Dynamics and Extensive Form Games , 2003 .

[93] David Silver,et al. A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[94] Jörgen W. Weibull,et al. Evolutionary Game Theory , 1996 .

[95] Yuan Qi,et al. Double Neural Counterfactual Regret Minimization , 2018, ICLR.

[96] Josef Hofbauer,et al. Time Average Replicator and Best-Reply Dynamics , 2009, Math. Oper. Res..