Newton-based Policy Optimization for Games

Many learning problems involve multiple agents optimizing different interactive functions. In these problems, the standard policy gradient algorithms fail due to the non-stationarity of the setting and the different interests of each agent. In fact, algorithms must take into account the complex dynamics of these systems to guarantee rapid convergence towards a (local) Nash equilibrium. In this paper, we propose NOHD (Newton Optimization on Helmholtz Decomposition), a Newton-like algorithm for multi-agent learning problems based on the decomposition of the dynamics of the system in its irrotational (Potential) and solenoidal (Hamiltonian) component. This method ensures quadratic convergence in purely irrotational systems and pure solenoidal systems. Furthermore, we show that NOHD is attracted to stable fixed points in general multi-agent systems and repelled by strict saddle ones. Finally, we empirically compare the NOHD's performance with that of state-of-the-art algorithms on some bimatrix games and in a continuous Gridworld environment.

[1]  Amnon Shashua,et al.  Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving , 2016, ArXiv.

[2]  Sonia Martínez,et al.  Coverage control for mobile sensing networks , 2002, IEEE Transactions on Robotics and Automation.

[3]  Shimon Whiteson,et al.  Learning with Opponent-Learning Awareness , 2017, AAMAS.

[4]  Vincent Conitzer,et al.  AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[5]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[6]  J. Goodman Note on Existence and Uniqueness of Equilibrium Points for Concave N-Person Games , 1965 .

[7]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Victor R. Lesser,et al.  Multi-Agent Learning with Policy Prediction , 2010, AAAI.

[9]  Lillian J. Ratliff,et al.  Convergence of Learning Dynamics in Stackelberg Games , 2019, ArXiv.

[10]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[11]  J. Nash Equilibrium Points in N-Person Games. , 1950, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[13]  Dennis Spellman,et al.  Vector analysis and an introduction to tensor analysis , 1980 .

[14]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[15]  Christos H. Papadimitriou,et al.  From Nash Equilibria to Chain Recurrent Sets: Solution Concepts and Topology , 2016, ITCS.

[16]  R. Rosenthal A class of games possessing pure-strategy Nash equilibria , 1973 .

[17]  S. Shankar Sastry,et al.  On Finding Local Nash Equilibria (and Only Local Nash Equilibria) in Zero-Sum Games , 2019, 1901.00838.

[18]  Christos H. Papadimitriou,et al.  Cycles in adversarial regularized learning , 2017, SODA.

[19]  Aryan Mokhtari,et al.  A Newton-Based Method for Nonconvex Optimization with Fast Evasion of Saddle Points , 2017, SIAM J. Optim..

[20]  Bart De Schutter,et al.  Multi-agent Reinforcement Learning: An Overview , 2010 .

[21]  J. Zico Kolter,et al.  Gradient descent GAN optimization is locally stable , 2017, NIPS.

[22]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[23]  Sridhar Mahadevan,et al.  Global Convergence to the Equilibrium of GANs using Variational Inequalities , 2018, ArXiv.

[24]  Michael I. Jordan,et al.  First-order methods almost always avoid saddle points: The case of vanishing step-sizes , 2019, NeurIPS.

[25]  S. Shankar Sastry,et al.  Characterization and computation of local Nash equilibria in continuous games , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[26]  L. Shapley,et al.  Potential Games , 1994 .

[27]  Shimon Whiteson,et al.  Stable Opponent Shaping in Differentiable Games , 2018, ICLR.

[28]  Sebastian Nowozin,et al.  The Numerics of GANs , 2017, NIPS.

[29]  Michael I. Jordan,et al.  Policy-Gradient Algorithms Have No Guarantees of Convergence in Continuous Action and State Multi-Agent Settings , 2019, ArXiv.

[30]  Thore Graepel,et al.  Differentiable Game Mechanics , 2019, J. Mach. Learn. Res..

[31]  Hung Manh La,et al.  Cooperative and Distributed Reinforcement Learning of Drones for Field Coverage , 2018, ArXiv.

[32]  Florian Schäfer,et al.  Competitive Gradient Descent , 2019, NeurIPS.

[33]  S. Shankar Sastry,et al.  On the Characterization of Local Nash Equilibria in Continuous Games , 2014, IEEE Transactions on Automatic Control.

[34]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Nina Narodytska,et al.  RelGAN: Relational Generative Adversarial Networks for Text Generation , 2019, ICLR.

[36]  Alejandro Ribeiro,et al.  Hessian Aided Policy Gradient , 2019, ICML.

[37]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[38]  A. P. Wills Vector analysis,: With an introduction to tensor analysis, , 1931 .

[39]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[40]  S. Shankar Sastry,et al.  On Gradient-Based Learning in Continuous Games , 2018, SIAM J. Math. Data Sci..

[41]  Noam Brown,et al.  Superhuman AI for multiplayer poker , 2019, Science.

[42]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[43]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[44]  Yishay Mansour,et al.  Nash Convergence of Gradient Dynamics in General-Sum Games , 2000, UAI.

[45]  Michael I. Jordan,et al.  Gradient Descent Converges to Minimizers , 2016, ArXiv.

[46]  Peter I. Corke,et al.  Networked Robots: Flying Robot Navigation using a Sensor Net , 2003, ISRR.

[47]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[48]  Thore Graepel,et al.  The Mechanics of n-Player Differentiable Games , 2018, ICML.

[49]  Constantinos Daskalakis,et al.  Training GANs with Optimism , 2017, ICLR.

[50]  Tamer Basar,et al.  Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms , 2019, Handbook of Reinforcement Learning and Control.

[51]  Francisco Facchinei,et al.  Generalized Nash Equilibrium Problems , 2010, Ann. Oper. Res..

[52]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[53]  Chongjie Zhang,et al.  Convergence of Multi-Agent Learning with a Finite Step Size in General-Sum Games , 2019, AAMAS.

[54]  Georgios B. Giannakis,et al.  Distributed Optimal Power Flow for Smart Microgrids , 2012, IEEE Transactions on Smart Grid.