论文信息 - Multiagent Reinforcement Learning: Spiking and Nonspiking Agents in the Iterated Prisoner's Dilemma

Multiagent Reinforcement Learning: Spiking and Nonspiking Agents in the Iterated Prisoner's Dilemma

This paper investigates multiagent reinforcement learning (MARL) in a general-sum game where the payoffs' structure is such that the agents are required to exploit each other in a way that benefits all agents. The contradictory nature of these games makes their study in multiagent systems quite challenging. In particular, we investigate MARL with spiking and nonspiking agents in the Iterated Prisoner's Dilemma by exploring the conditions required to enhance its cooperative outcome. The spiking agents are neural networks with leaky integrate-and-fire neurons trained with two different learning algorithms: 1) reinforcement of stochastic synaptic transmission, or 2) reward-modulated spike-timing-dependent plasticity with eligibility trace. The nonspiking agents use a tabular representation and are trained with Q- and SARSA learning algorithms, with a novel reward transformation process also being applied to the Q-learning agents. According to the results, the cooperative outcome is enhanced by: 1) transformed internal reinforcement signals and a combination of a high learning rate and a low discount factor with an appropriate exploration schedule in the case of non-spiking agents, and 2) having longer eligibility trace time constant in the case of spiking agents. Moreover, it is shown that spiking and nonspiking agents have similar behavior and therefore they can equally well be used in a multiagent interaction setting. For training the spiking agents in the case where more than one output neuron competes for reinforcement, a novel and necessary modification that enhances competition is applied to the two learning algorithms utilized, in order to avoid a possible synaptic saturation. This is done by administering to the networks additional global reinforcement signals for every spike of the output neurons that were not “responsible” for the preceding decision.

[1] R. K. Simpson. Nature Neuroscience , 2022 .

[2] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[3] Peter L. Bartlett,et al. Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[4] Michael L. Littman,et al. A hierarchy of prescriptive goals for multiagent learning , 2007, Artif. Intell..

[5] T J Sejnowski,et al. Irregular synchronous activity in stochastically-coupled networks of integrate-and-fire neurons. , 1998, Network.

[6] Xiaohui Xie,et al. Learning in neural networks by reinforcement of irregular spiking. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7] Michael H. Bowling,et al. Convergence and No-Regret in Multiagent Learning , 2004, NIPS.

[8] Yoav Shoham,et al. If multi-agent learning is the answer, what is the question? , 2007, Artif. Intell..

[9] Xiaolong Ma,et al. Global Reinforcement Learning in Neural Networks , 2007, IEEE Transactions on Neural Networks.

[10] Drew Fudenberg,et al. An economist's perspective on multi-agent learning , 2007, Artif. Intell..

[11] Yoav Shoham,et al. A general criterion and an algorithmic framework for learning in multi-agent systems , 2007, Machine Learning.

[12] Vincent Conitzer,et al. AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[13] I. Pavlov. Conditioned Reflexes: An Investigation of the Physiological Activity of the Cerebral Cortex , 1929 .

[14] Daniel Kudenko,et al. Adaptive Agents and Multi-Agent Systems , 2003, Lecture Notes in Computer Science.

[15] Peter Stone,et al. Multiagent learning is not the answer. It is the question , 2007, Artif. Intell..

[16] Chris Watkins,et al. Learning from delayed rewards , 1989 .

[17] Simon Parsons,et al. What evolutionary game theory tells us about multiagent learning , 2007, Artif. Intell..

[18] Karl Tuyls,et al. An Evolutionary Dynamical Analysis of Multi-Agent Learning in Iterated Games , 2005, Autonomous Agents and Multi-Agent Systems.

[19] W. Hamilton,et al. The evolution of cooperation. , 1984, Science.

[20] Keith B. Hall,et al. Correlated Q-Learning , 2003, ICML.

[21] Nuttapong Chentanez,et al. Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[22] Richard L. Lewis,et al. Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective , 2010, IEEE Transactions on Autonomous Mental Development.

[23] Peter L. Bartlett,et al. Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[24] C. Christodoulou,et al. Is self-control a learned strategy employed by a reward maximizing brain? , 2009, BMC Neuroscience.

[25] Matthew Saffell,et al. Learning to trade via direct reinforcement , 2001, IEEE Trans. Neural Networks.

[26] Samuel M. McClure,et al. Separate Neural Systems Value Immediate and Delayed Monetary Rewards , 2004, Science.

[27] Bart De Schutter,et al. A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[28] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[29] Robert H. Crites,et al. Multiagent reinforcement learning in the Iterated Prisoner's Dilemma. , 1996, Bio Systems.

[30] Gregory S. Kavka. Is Individual Choice Less Problematic than Collective Choice? , 1991, Economics and Philosophy.

[31] R. Lathe. Phd by thesis , 1988, Nature.

[32] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[33] Malvern Lumsden,et al. The Cyprus Conflict as a Prisoner's Dilemma Game , 1973 .

[34] Ronald Smith,et al. The Prisoner's Dilemma and Regime-Switching in the Greek-Turkish Arms Race , 2000 .

[35] Yishay Mansour,et al. Nash Convergence of Gradient Dynamics in General-Sum Games , 2000, UAI.

[36] Ralph Neuneier,et al. Multi-agent modeling of multiple FX-markets by neural networks , 2001, IEEE Trans. Neural Networks.

[37] M. McLure. One Hundred Years from Today: Vilfredo Pareto, Manuale di Economia Politica con una Introduzione alla Scienza Sociale, Milan: Societa Editrice Libraria. 1906 , 2006 .

[38] Koichi Moriyama,et al. Utility based Q-learning to facilitate cooperation in Prisoner's Dilemma games , 2009, Web Intell. Agent Syst..

[39] Peter Tino,et al. IEEE Transactions on Neural Networks , 2009 .

[40] Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[41] Y. Dan,et al. Spike Timing-Dependent Plasticity of Neural Circuits , 2004, Neuron.

[42] H. Seung,et al. Learning in Spiking Neural Networks by Reinforcement of Stochastic Synaptic Transmission , 2003, Neuron.

[43] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[44] E. Izhikevich. Solving the distal reward problem through linkage of STDP and dopamine signaling , 2007, BMC Neuroscience.

[45] John S. Edwards,et al. The Hedonistic Neuron: A Theory of Memory, Learning and Intelligence , 1983 .

[46] Ran Ginosar,et al. Adaptive Cardiac Resynchronization Therapy Device Based on Spiking Neurons Architecture and Reinforcement Learning Scheme , 2007, IEEE Transactions on Neural Networks.

[47] G. Bi,et al. Synaptic Modifications in Cultured Hippocampal Neurons: Dependence on Spike Timing, Synaptic Strength, and Postsynaptic Cell Type , 1998, The Journal of Neuroscience.

[48] A. Rapoport,et al. Prisoner's Dilemma: A Study in Conflict and Co-operation , 1970 .

[49] Daniel Kudenko,et al. Reinforcement Learning of Coordination in Heterogeneous Cooperative Multi-agent Systems , 2005, Adaptive Agents and Multi-Agent Systems.

[50] A. Hodgkin,et al. A quantitative description of membrane current and its application to conduction and excitation in nerve , 1990 .

[51] Craig Boutilier,et al. The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[52] Lukasz A. Kurgan,et al. A new synaptic plasticity rule for networks of spiking neurons , 2006, IEEE Transactions on Neural Networks.

[53] Marco Wiering,et al. Convergence and Divergence in Standard and Averaging Reinforcement Learning , 2004, ECML.

[54] Manuela M. Veloso,et al. Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[55] Wulfram Gerstner,et al. Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail , 2009, PLoS Comput. Biol..

[56] Manuela M. Veloso,et al. Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[57] Ronald J. MacGregor,et al. Neural and brain modeling , 1987 .

[58] Chris Christodoulou,et al. Does High Firing Irregularity Enhance Learning? , 2011, Neural Computation.

[59] Vilfredo Pareto,et al. Manuale di economia politica , 1965 .

[60] D. Johnston,et al. Regulation of Synaptic Efficacy by Coincidence of Postsynaptic APs and EPSPs , 1997 .

[61] Ahmet Sözen,et al. Negotiating a Resolution to the Cyprus Problem: Is Potential European Union Membership a Blessing or a Curse? , 2002 .

[62] Eugene M. Izhikevich,et al. Simple model of spiking neurons , 2003, IEEE Trans. Neural Networks.

[63] M. Farries,et al. Reinforcement learning with modulated spike timing dependent synaptic plasticity. , 2007, Journal of neurophysiology.

[64] Geoffrey J. Gordon. Agendas for multi-agent learning , 2007, Artif. Intell..

[65] Kazushi Ikeda,et al. A statistical property of multiagent learning based on Markov decision process , 2006, IEEE Trans. Neural Networks.

[66] Gillian M. Hayes,et al. Evolution of Valence Systems in an Unstable Environment , 2008, SAB.

[67] H. J. Mclaughlin,et al. Learn , 2002 .

[68] Ron Meir,et al. Reinforcement Learning, Spike-Time-Dependent Plasticity, and the BCM Rule , 2007, Neural Computation.

[69] Jean-Pascal Pfister,et al. Optimal Spike-Timing-Dependent Plasticity for Precise Action Potential Firing in Supervised Learning , 2005, Neural Computation.

[70] Victor R. Lesser,et al. A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics , 2008, J. Artif. Intell. Res..

[71] Alvin E. Roth,et al. Multi-agent learning and the descriptive value of simple models , 2007, Artif. Intell..

[72] F. Charpillet,et al. Efficient Learning in Games , 2006 .

[73] Chris Christodoulou,et al. Multiagent Reinforcement Learning with Spiking and Non-Spiking Agents in the Iterated Prisoner's Dilemma , 2009, ICANN.

[74] Markus Diesmann,et al. A Spiking Neural Network Model of an Actor-Critic Learning Agent , 2009, Neural Computation.

[75] Razvan V. Florian,et al. Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity , 2007, Neural Computation.

[76] Chris Christodoulou,et al. Multiagent Reinforcement Learning in the Iterated Prisoner's Dilemma: Fast cooperation through evolved payoffs , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[77] Michael P. Wellman,et al. Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[78] David Kraines,et al. The Threshold of Cooperation Among Adaptive Agents: Pavlov and the Stag Hunt , 1996, ATAL.

[79] C. Christodoulou,et al. Self-control with spiking and non-spiking neural networks playing games , 2010, Journal of Physiology-Paris.

[80] M. Dufwenberg. Game theory. , 2011, Wiley interdisciplinary reviews. Cognitive science.

[81] Guido Bugmann,et al. A Spiking Neuron Model: Applications and Learning , 2002, Neural Networks.

[82] Yoonsuck Choe,et al. Extrapolative Delay Compensation Through Facilitating Synapses and Its Relation to the Flash-Lag Effect , 2008, IEEE Transactions on Neural Networks.

[83] L. Abbott,et al. Synaptic plasticity: taming the beast , 2000, Nature Neuroscience.

[84] Xin Yao,et al. The Iterated Prisoners' Dilemma - 20 Years On , 2007, Advances in Natural Computation.

[85] David H. Ackley,et al. Interactions between learning and evolution , 1991 .

[86] R. J. MacGregor,et al. A model for repetitive firing in neurons , 2004, Kybernetik.

[87] Eugene M. Izhikevich,et al. Which model to use for cortical spiking neurons? , 2004, IEEE Transactions on Neural Networks.

[88] Luigi Fortuna,et al. Learning Anticipation via Spiking Networks: Application to Navigation Control , 2009, IEEE Transactions on Neural Networks.

[89] J. Nash. Equilibrium Points in N-Person Games. , 1950, Proceedings of the National Academy of Sciences of the United States of America.

[90] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[91] B. Babkin. Conditioned Reflexes; an Investigation of the Physiological Activity of the Cerebral Cortex. , 1929 .

[92] R. Stein. Some models of neuronal variability. , 1967, Biophysical journal.

[93] D. Wilkin,et al. Neuron , 2001, Brain Research.

[94] Michael L. Littman,et al. Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[95] M. Nowak,et al. A strategy of win-stay, lose-shift that outperforms tit-for-tat in the Prisoner's Dilemma game , 1993, Nature.

[96] Michael A. Goodrich,et al. Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning , 2011, Machine Learning.

[97] Robert A. Legenstein,et al. A Learning Theory for Reward-Modulated Spike-Timing-Dependent Plasticity with Application to Biofeedback , 2008, PLoS Comput. Biol..

[98] Bikramjit Banerjee,et al. Convergent Gradient Ascent in General-Sum Games , 2002, ECML.