A Spiking Neural Network Model of an Actor-Critic Learning Agent

The ability to adapt behavior to maximize reward as a result of interactions with the environment is crucial for the survival of any higher organism. In the framework of reinforcement learning, temporal-difference learning algorithms provide an effective strategy for such goal-directed adaptation, but it is unclear to what extent these algorithms are compatible with neural computation. In this article, we present a spiking neural network model that implements actor-critic temporal-difference learning by combining local plasticity rules with a global reward signal. The network is capable of solving a nontrivial gridworld task with sparse rewards. We derive a quantitative mapping of plasticity parameters and synaptic weights to the corresponding variables in the standard algorithmic formulation and demonstrate that the network learns with a similar speed to its discrete time counterpart and attains the same equilibrium performance.

[1]  Ian H. Witten,et al.  An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[2]  A P Georgopoulos,et al.  On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex , 1982, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[3]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  A. Harry Klopf,et al.  A drive-reinforcement model of single neuron function , 1987 .

[5]  B. Kosco Differential Hebbian learning , 1987 .

[6]  A. Klopf A neuronal model of classical conditioning , 1988 .

[7]  Daniel J. Amit,et al.  Modeling brain function: the world of attractor neural networks, 1st Edition , 1989 .

[8]  A. Aertsen,et al.  Synaptic plasticity in rat hippocampal slice cultures: local "Hebbian" conjunction of pre- and postsynaptic stimulation leads to distributed synaptic enhancement. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[9]  J. Bolz,et al.  Non-Hebbian synapses in rat visual cortex. , 1990, Neuroreport.

[10]  P. Dayan The Convergence of TD(λ) for General λ , 2004, Machine Learning.

[11]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[12]  Joel L. Davis,et al.  A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[13]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[14]  D. Madison,et al.  Locally distributed synaptic potentiation in the hippocampus. , 1994, Science.

[15]  P. Dayan,et al.  TD(λ) converges with probability 1 , 2004, Machine Learning.

[16]  A. Barto,et al.  Adaptive Critics and the Basal Ganglia , 1994 .

[17]  Joel L. Davis,et al.  Adaptive Critics and the Basal Ganglia , 1995 .

[18]  Peter Dayan,et al.  Bee foraging in uncertain environments using predictive hebbian learning , 1995, Nature.

[19]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[20]  M. Poo,et al.  Spread of Synaptic Depression Mediated by Presynaptic Cytoplasmic Signaling , 1996, Science.

[21]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[22]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[23]  M. Poo,et al.  Propagation of activity-dependent synaptic depression in simple neural networks , 1997, Nature.

[24]  D. Johnston,et al.  Regulation of Synaptic Efficacy by Coincidence of Postsynaptic APs and EPSPs , 1997 .

[25]  Y. Frégnac,et al.  A phenomenological model of visually evoked spike trains in cat geniculate nonlagged X-cells , 1998, Visual Neuroscience.

[26]  John S. Denker,et al.  Neural Networks for Computing , 1998 .

[27]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[28]  Li I. Zhang,et al.  A critical window for cooperation and competition among developing retinotectal synapses , 1998, Nature.

[29]  G. Bi,et al.  Synaptic Modifications in Cultured Hippocampal Neurons: Dependence on Spike Timing, Synaptic Strength, and Postsynaptic Cell Type , 1998, The Journal of Neuroscience.

[30]  W. Schultz,et al.  A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task , 1999, Neuroscience.

[31]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[32]  David J. Foster,et al.  A model of hippocampally dependent navigation, using the temporal difference learning rule , 2000, Hippocampus.

[33]  Li I. Zhang,et al.  Selective Presynaptic Propagation of Long-Term Potentiation in Defined Neural Networks , 2000, The Journal of Neuroscience.

[34]  K. Doya Complementary roles of basal ganglia and cerebellum in learning and motor control , 2000, Current Opinion in Neurobiology.

[35]  R. Kempter,et al.  Temporal map formation in the barn owl's brain. , 2001, Physical review letters.

[36]  Rajesh P. N. Rao,et al.  Spike-Timing-Dependent Hebbian Plasticity as Temporal Difference Learning , 2001, Neural Computation.

[37]  R. Kempter,et al.  Formation of temporal-feature maps by axonal propagation of synaptic learning , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Jun Morimoto,et al.  Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning , 2000, Robotics Auton. Syst..

[39]  Roland E. Suri,et al.  Temporal Difference Model Reproduces Anticipatory Neural Activity , 2001, Neural Computation.

[40]  W. Schultz Getting Formal with Dopamine and Reward , 2002, Neuron.

[41]  Y. Niv,et al.  Evolution of Reinforcement Learning in Uncertain Environments: A Simple Explanation for Complex Foraging Behaviors , 2002 .

[42]  Kenji Doya,et al.  Metalearning and neuromodulation , 2002, Neural Networks.

[43]  Eytan Ruppin,et al.  Actor-critic models of the basal ganglia: new anatomical and computational perspectives , 2002, Neural Networks.

[44]  J. Leo van Hemmen,et al.  Mapping time , 2002, Biological Cybernetics.

[45]  John N. J. Reynolds,et al.  Dopamine-dependent plasticity of corticostriatal synapses , 2002, Neural Networks.

[46]  Y. Dan,et al.  Spike-timing-dependent synaptic modification induced by natural spike trains , 2002, Nature.

[47]  Florentin Wörgötter,et al.  Isotropic Sequence Order Learning , 2003, Neural Computation.

[48]  Karl J. Friston,et al.  Temporal Difference Models and Reward-Related Learning in the Human Brain , 2003, Neuron.

[49]  H. Seung,et al.  Learning in Spiking Neural Networks by Reinforcement of Stochastic Synaptic Transmission , 2003, Neuron.

[50]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[51]  E. Kandel,et al.  Activity-Dependent Presynaptic Facilitation and Hebbian LTP Are Both Required and Interact during Classical Conditioning in Aplysia , 2003, Neuron.

[52]  Xiaohui Xie,et al.  Learning in neural networks by reinforcement of irregular spiking. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[53]  Karl J. Friston,et al.  Dissociable Roles of Ventral and Dorsal Striatum in Instrumental Conditioning , 2004, Science.

[54]  Patrick D. Roberts,et al.  Computational Consequences of Temporally Asymmetric Learning Rules: I. Differential Hebbian Learning , 1999, Journal of Computational Neuroscience.

[55]  Patrick D. Roberts,et al.  Computational Consequences of Temporally Asymmetric Learning Rules: II. Sensory Image Cancellation , 2000, Journal of Computational Neuroscience.

[56]  M. Delgado,et al.  Modulation of Caudate Activity by Action Contingency , 2004, Neuron.

[57]  Peter Dayan,et al.  Temporal difference models describe higher-order learning in humans , 2004, Nature.

[58]  Daniel Lehmann,et al.  Modeling Compositionality by Dynamic Binding of Synfire Chains , 2004, Journal of Computational Neuroscience.

[59]  S. Thorpe,et al.  Spike times make sense , 2005, Trends in Neurosciences.

[60]  Florentin Wörgötter,et al.  Temporal Sequence Learning, Prediction, and Control: A Review of Different Models and Their Relation to Biological Mechanisms , 2005, Neural Computation.

[61]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[62]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[63]  Rémi Munos,et al.  Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[64]  W. Gerstner,et al.  Triplets of Spikes in a Model of Spike Timing-Dependent Plasticity , 2006, The Journal of Neuroscience.

[65]  R. Dolan,et al.  Dopamine-dependent prediction errors underpin reward-seeking behaviour in humans , 2006, Nature.

[66]  Markus Diesmann,et al.  Programmable Logic Construction Kits for Hyper-Real-Time Neuronal Modeling , 2006, Neural Computation.

[67]  E. Vaadia,et al.  Midbrain dopamine neurons encode decisions for future action , 2006, Nature Neuroscience.

[68]  Stefan Philipp,et al.  Interconnecting VLSI Spiking Neural Networks Using Isochronous Connections , 2007, IWANN.

[69]  Razvan V. Florian,et al.  Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity , 2007, Neural Computation.

[70]  R. O’Reilly,et al.  Separate neural substrates for skill learning and performance in the ventral and dorsal striatum , 2007, Nature Neuroscience.

[71]  E. Izhikevich Solving the distal reward problem through linkage of STDP and dopamine signaling , 2007, BMC Neuroscience.

[72]  Marc-Oliver Gewaltig,et al.  NEST (NEural Simulation Tool) , 2007, Scholarpedia.

[73]  Johannes Schemmel,et al.  Spike-Frequency Adapting Neural Ensembles: Beyond Mean Adaptation and Renewal Theories , 2007, Neural Computation.

[74]  Florentin Wörgötter,et al.  Learning with Relevance: Using a Third Factor to Stabilize Hebbian Learning , 2007, Neural Computation.

[75]  B. Richmond,et al.  Knowing without doing , 2007, Nature Neuroscience.

[76]  Ron Meir,et al.  Reinforcement Learning, Spike-Time-Dependent Plasticity, and the BCM Rule , 2007, Neural Computation.

[77]  M. Farries,et al.  Reinforcement learning with modulated spike timing dependent synaptic plasticity. , 2007, Journal of neurophysiology.

[78]  B. Kosko Differential Hebbian learning , 2008 .

[79]  Wulfram Gerstner,et al.  Phenomenological models of synaptic plasticity based on spike timing , 2008, Biological Cybernetics.

[80]  Yoshua Bengio,et al.  Alternative time representation in dopamine models , 2009, Journal of Computational Neuroscience.

[81]  Minija Tamosiunaite,et al.  On the Asymptotic Equivalence Between Differential Hebbian and Temporal Difference Learning , 2008, Neural Computation.

[82]  Hiroyuki Nakahara,et al.  Internal-Time Temporal Difference Model for Neural Value-Based Decision Making , 2010, Neural Computation.

[83]  Markus Diesmann,et al.  Compositionality of arm movements can be realized by propagating synchrony , 2010, Journal of Computational Neuroscience.

[84]  Jean-Marc Fellous,et al.  Computational models of reinforcement learning: the role of dopamine as a reward signal , 2010, Cognitive Neurodynamics.

[85]  Chris Christodoulou,et al.  Multiagent Reinforcement Learning: Spiking and Nonspiking Agents in the Iterated Prisoner's Dilemma , 2011, IEEE Transactions on Neural Networks.