Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity

Abstract The persistent modification of synaptic efficacy as a function of the relative timing of pre- and postsynaptic spikes is a phenomenon known as spike-timing-dependent plasticity (STDP). Here we show that the modulation of STDP by a global reward signal leads to reinforcement learning. We first derive analytically learning rules involving reward-modulated spike-timing-dependent synaptic and intrinsic plasticity, by applying a reinforcement learning algorithm to the stochastic spike response model of spiking neurons. These rules have several features common to plasticity mechanisms experimentally found in the brain. We then demonstrate in simulations of networks of integrate-and-fire neurons the efficacy of two simple learning rules involving modulated STDP. One rule is a direct extension of the standard STDP model (modulated STDP), and the other one involves an eligibility trace stored at each synapse that keeps a decaying memory of the relationships between the recent pairs of pre- and postsynaptic spike pairs (modulated STDP with eligibility trace). This latter rule permits learning even if the reward signal is delayed. The proposed rules are able to solve the XOR problem with both rate coded and temporally coded input and to learn a target output firing-rate pattern. These learning rules are biologically plausible, may be used for training generic artificial spiking neural networks, regardless of the neural model used, and suggest the experimental investigation in animals of the existence of reward-modulated STDP.

[1]  P. Anandan,et al.  Pattern-recognizing stochastic learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  A G Barto,et al.  Learning by statistical cooperation of self-interested neuron-like computing elements. , 1985, Human neurobiology.

[3]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[4]  Michael I. Jordan,et al.  A more biologically plausible learning rule for neural networks. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Y. Dan,et al.  Hebbian depression of isolated neuromuscular synapses in vitro. , 1992, Science.

[6]  William H. Press,et al.  Numerical recipes in C (2nd ed.): the art of scientific computing , 1992 .

[7]  Joab R Winkler,et al.  Numerical recipes in C: The art of scientific computing, second edition , 1993 .

[8]  Terrence J. Sejnowski,et al.  Reinforcement Learning Predicts the Site of Plasticity for Auditory Remapping in the Barn Owl , 1994, NIPS.

[9]  Stassinopoulos,et al.  Democratic reinforcement: A principle for brain function. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[10]  Dimitris Stassinopoulos,et al.  Democratic reinforcement: learning via self-organization , 1995 .

[11]  Alstrom,et al.  Versatility and adaptive performance. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[12]  V. Han,et al.  Synaptic plasticity in a cerebellum-like structure depends on temporal order , 1997, Nature.

[13]  H. Markram,et al.  Regulation of Synaptic Efficacy by Coincidence of Postsynaptic APs and EPSPs , 1997, Science.

[14]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[15]  G. Bi,et al.  Synaptic Modifications in Cultured Hippocampal Neurons: Dependence on Spike Timing, Synaptic Strength, and Postsynaptic Cell Type , 1998, The Journal of Neuroscience.

[16]  R. Kempter,et al.  Hebbian learning and spiking neurons , 1999 .

[17]  P. Bartlett,et al.  Direct Gradient-Based Reinforcement Learning: II. Gradient Ascent Algorithms and Experiments , 1999 .

[18]  J. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes: implementation issues , 1999, Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304).

[19]  P. Bartlett,et al.  Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms , 1999 .

[20]  B. Sakmann,et al.  Coincidence detection and changes of synaptic efficacy in spiny stellate neurons in rat barrel cortex , 1999, Nature Neuroscience.

[21]  Xiaohui Xie,et al.  Spike-based Learning Rules and Stabilization of Persistent Neural Activity , 1999, NIPS.

[22]  L. Abbott,et al.  Synaptic plasticity: taming the beast , 2000, Nature Neuroscience.

[23]  V. Han,et al.  Reversible Associative Depression and Nonassociative Potentiation at a Parallel Fiber Synapse , 2000, Neuron.

[24]  Masafumi Hagiwara,et al.  Reinforcement learning algorithm with network extension for pulse neural network , 2000, Smc 2000 conference proceedings. 2000 ieee international conference on systems, man and cybernetics. 'cybernetics evolving to systems, humans, organizations, and their complex interactions' (cat. no.0.

[25]  L. Abbott,et al.  Competitive Hebbian learning through spike-timing-dependent synaptic plasticity , 2000, Nature Neuroscience.

[26]  P. Bartlett,et al.  Stochastic optimization of controlled partially observable Markov decision processes , 2000, Proceedings of the 39th IEEE Conference on Decision and Control (Cat. No.00CH37187).

[27]  T. Nick,et al.  Synaptic activity modulates presynaptic excitability , 2000, Nature Neuroscience.

[28]  D. Linden,et al.  Rapid, synaptically driven increases in the intrinsic excitability of cerebellar deep nuclear neurons , 2000, Nature Neuroscience.

[29]  M. Poo,et al.  Enhancement of presynaptic neuronal excitability by correlated presynaptic and postsynaptic spiking , 2000, Nature Neuroscience.

[30]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[31]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[32]  Wulfram Gerstner,et al.  Intrinsic Stabilization of Output Rates by Spike-Based Hebbian Learning , 2001, Neural Computation.

[33]  Rajesh P. N. Rao,et al.  Spike-Timing-Dependent Hebbian Plasticity as Temporal Difference Learning , 2001, Neural Computation.

[34]  W. Gerstner,et al.  Chapter 12 A framework for spiking neuron models: The spike response model , 2001 .

[35]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[36]  Wulfram Gerstner,et al.  Spiking Neuron Models , 2002 .

[37]  W. Schultz Getting Formal with Dopamine and Reward , 2002, Neuron.

[38]  Patrick D. Roberts,et al.  Spike timing dependent synaptic plasticity in biological systems , 2002, Biological Cybernetics.

[39]  Wulfram Gerstner,et al.  Spiking Neuron Models: An Introduction , 2002 .

[40]  Masafumi Hagiwara,et al.  A pulse neural network learning algorithm for POMDP environment , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[41]  Karl J. Friston,et al.  Cholinergic Modulation of Experience-Dependent Plasticity in Human Auditory Cortex , 2002, Neuron.

[42]  Y. Dan,et al.  Spike-timing-dependent synaptic modification induced by natural spike trains , 2002, Nature.

[43]  Peter L. Bartlett,et al.  Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning , 2000, J. Comput. Syst. Sci..

[44]  D. Debanne,et al.  Long-term plasticity of intrinsic excitability: learning rules and mechanisms. , 2003, Learning & memory.

[45]  John N. Tsitsiklis,et al.  Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes , 2003, Discret. Event Dyn. Syst..

[46]  M. Min,et al.  Enhancement of Associative Long-Term Potentiation by Activation of β-Adrenergic Receptors at CA1 Synapses in Rat Hippocampal Slices , 2003, The Journal of Neuroscience.

[47]  Gal Chechik,et al.  Spike-Timing-Dependent Plasticity and Relevant Mutual Information Maximization , 2003, Neural Computation.

[48]  L. Abbott,et al.  Homeostasis and Learning through Spike-Timing Dependent Plasticity , 2003 .

[49]  J J Hopfield,et al.  Learning rules and network repair in spike-timing-based computation networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[50]  H. Seung,et al.  Learning in Spiking Neural Networks by Reinforcement of Stochastic Synaptic Transmission , 2003, Neuron.

[51]  D. Linden,et al.  The other side of the engram: experience-driven changes in neuronal intrinsic excitability , 2003, Nature Reviews Neuroscience.

[52]  Haim Sompolinsky,et al.  Learning Input Correlations through Nonlinear Temporally Asymmetric Hebbian Plasticity , 2003, The Journal of Neuroscience.

[53]  W. Press,et al.  Numerical Recipes in C++: The Art of Scientific Computing (2nd edn)1 Numerical Recipes Example Book (C++) (2nd edn)2 Numerical Recipes Multi-Language Code CD ROM with LINUX or UNIX Single-Screen License Revised Version3 , 2003 .

[54]  J. Seamans,et al.  The principal features and mechanisms of dopamine modulation in the prefrontal cortex , 2004, Progress in Neurobiology.

[55]  Robert H. Cudmore,et al.  Long-term potentiation of intrinsic excitability in LV visual cortical neurons. , 2004, Journal of neurophysiology.

[56]  Xiaohui Xie,et al.  Learning in neural networks by reinforcement of irregular spiking. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[57]  E. Kandel,et al.  Genetic evidence for the bidirectional modulation of synaptic plasticity in the prefrontal cortex by D1 receptors. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[58]  Sander M. Bohte,et al.  Reducing Spike Train Variability: A Computational Theory Of Spike-Timing Dependent Plasticity , 2004, BNAIC.

[59]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[60]  S. Nelson,et al.  Homeostatic plasticity in the developing nervous system , 2004, Nature Reviews Neuroscience.

[61]  Patrick D. Roberts,et al.  Computational Consequences of Temporally Asymmetric Learning Rules: I. Differential Hebbian Learning , 1999, Journal of Computational Neuroscience.

[62]  Y. Dan,et al.  Spike Timing-Dependent Plasticity of Neural Circuits , 2004, Neuron.

[63]  Patrick D. Roberts,et al.  Computational Consequences of Temporally Asymmetric Learning Rules: II. Sensory Image Cancellation , 2000, Journal of Computational Neuroscience.

[64]  Lucas C. Parra,et al.  Maximising Sensitivity in a Spiking Network , 2004, NIPS.

[65]  Jean-Pascal Pfister,et al.  Spike-timing Dependent Plasticity and Mutual Information Maximization for a Spiking Neuron Model , 2004, NIPS.

[66]  Sander M. Bohte,et al.  The evidence for neural information processing with precise spike-times: A survey , 2004, Natural Computing.

[67]  M. Poo,et al.  Bidirectional Modification of Presynaptic Neuronal Excitability Accompanying Spike Timing-Dependent Synaptic Plasticity , 2004, Neuron.

[68]  Robert A. Legenstein,et al.  What Can a Neuron Learn with Spike-Timing-Dependent Plasticity? , 2005, Neural Computation.

[69]  G. Beslon,et al.  Learning Methods for Dynamic Neural Networks , 2005 .

[70]  Masafumi Hagiwara,et al.  A pulse neural network reinforcement learning algorithm for partially observable Markov decision processes , 2005 .

[71]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[72]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[73]  Razvan V. Florian A reinforcement learning algorithm for spiking neural networks , 2005, Seventh International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC'05).

[74]  G. Beslon,et al.  Learning at the edge of chaos : Temporal Coupling of Spiking Neurons Controller for Autonomous Robotic , 2005 .

[75]  Jean-Pascal Pfister,et al.  Optimal Spike-Timing-Dependent Plasticity for Precise Action Potential Firing in Supervised Learning , 2005, Neural Computation.

[76]  Raul Cristian Muresan,et al.  Phase Precession and Recession with STDP and Anti-STDP , 2006, ICANN.

[77]  Erkki Oja,et al.  Artificial neural networks -- ICANN 2006 : 16th International Conference, Athens, Greece, September 10-14, 2006 : proceedings , 2006 .

[78]  Rebecca L. Vislay-Meltzer,et al.  Olfactory Coding: A Plastic Approach to Timing Precision , 2007, Current Biology.

[79]  Sander M. Bohte,et al.  Reducing the Variability of Neural Responses: A Computational Theory of Spike-Timing-Dependent Plasticity , 2007, Neural Computation.

[80]  Florentin Wörgötter,et al.  Mathematical properties of neuronal TD-rules and differential Hebbian learning: a comparison , 2008, Biological Cybernetics.

[81]  Wulfram Gerstner,et al.  Phenomenological models of synaptic plasticity based on spike timing , 2008, Biological Cybernetics.

[82]  Walter Senn,et al.  A Gradient Learning Rule for the Tempotron , 2009, Neural Computation.

[83]  W. Senn,et al.  Reinforcement learning in populations of spiking neurons , 2008, Nature Neuroscience.

[84]  Markus Diesmann,et al.  A Spiking Neural Network Model of an Actor-Critic Learning Agent , 2009, Neural Computation.