论文信息 - A Spiking Neural Network Model of an Actor-Critic Learning Agent

A Spiking Neural Network Model of an Actor-Critic Learning Agent

The ability to adapt behavior to maximize reward as a result of interactions with the environment is crucial for the survival of any higher organism. In the framework of reinforcement learning, temporal-difference learning algorithms provide an effective strategy for such goal-directed adaptation, but it is unclear to what extent these algorithms are compatible with neural computation. In this article, we present a spiking neural network model that implements actor-critic temporal-difference learning by combining local plasticity rules with a global reward signal. The network is capable of solving a nontrivial gridworld task with sparse rewards. We derive a quantitative mapping of plasticity parameters and synaptic weights to the corresponding variables in the standard algorithmic formulation and demonstrate that the network learns with a similar speed to its discrete time counterpart and attains the same equilibrium performance.

[1] Ian H. Witten,et al. An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[2] A P Georgopoulos,et al. On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex , 1982, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[3] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[4] A. Harry Klopf,et al. A drive-reinforcement model of single neuron function , 1987 .

[5] B. Kosco. Differential Hebbian learning , 1987 .

[6] A. Klopf. A neuronal model of classical conditioning , 1988 .

[7] Daniel J. Amit,et al. Modeling brain function: the world of attractor neural networks, 1st Edition , 1989 .

[8] A. Aertsen,et al. Synaptic plasticity in rat hippocampal slice cultures: local "Hebbian" conjunction of pre- and postsynaptic stimulation leads to distributed synaptic enhancement. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[9] J. Bolz,et al. Non-Hebbian synapses in rat visual cortex. , 1990, Neuroreport.

[10] P. Dayan. The Convergence of TD(λ) for General λ , 2004, Machine Learning.

[11] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[12] Joel L. Davis,et al. A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[13] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[14] D. Madison,et al. Locally distributed synaptic potentiation in the hippocampus. , 1994, Science.

[15] P. Dayan,et al. TD(λ) converges with probability 1 , 2004, Machine Learning.

[16] A. Barto,et al. Adaptive Critics and the Basal Ganglia , 1994 .

[17] Joel L. Davis,et al. Adaptive Critics and the Basal Ganglia , 1995 .

[18] Peter Dayan,et al. Bee foraging in uncertain environments using predictive hebbian learning , 1995, Nature.

[19] P. Dayan,et al. A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[20] M. Poo,et al. Spread of Synaptic Depression Mediated by Presynaptic Cytoplasmic Signaling , 1996, Science.

[21] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[22] Peter Dayan,et al. A Neural Substrate of Prediction and Reward , 1997, Science.

[23] M. Poo,et al. Propagation of activity-dependent synaptic depression in simple neural networks , 1997, Nature.

[24] D. Johnston,et al. Regulation of Synaptic Efficacy by Coincidence of Postsynaptic APs and EPSPs , 1997 .

[25] Y. Frégnac,et al. A phenomenological model of visually evoked spike trains in cat geniculate nonlagged X-cells , 1998, Visual Neuroscience.

[26] John S. Denker,et al. Neural Networks for Computing , 1998 .

[27] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[28] Li I. Zhang,et al. A critical window for cooperation and competition among developing retinotectal synapses , 1998, Nature.

[29] G. Bi,et al. Synaptic Modifications in Cultured Hippocampal Neurons: Dependence on Spike Timing, Synaptic Strength, and Postsynaptic Cell Type , 1998, The Journal of Neuroscience.

[30] W. Schultz,et al. A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task , 1999, Neuroscience.

[31] Kenji Doya,et al. Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[32] David J. Foster,et al. A model of hippocampally dependent navigation, using the temporal difference learning rule , 2000, Hippocampus.

[33] Li I. Zhang,et al. Selective Presynaptic Propagation of Long-Term Potentiation in Defined Neural Networks , 2000, The Journal of Neuroscience.

[34] K. Doya. Complementary roles of basal ganglia and cerebellum in learning and motor control , 2000, Current Opinion in Neurobiology.

[35] R. Kempter,et al. Temporal map formation in the barn owl's brain. , 2001, Physical review letters.

[36] Rajesh P. N. Rao,et al. Spike-Timing-Dependent Hebbian Plasticity as Temporal Difference Learning , 2001, Neural Computation.

[37] R. Kempter,et al. Formation of temporal-feature maps by axonal propagation of synaptic learning , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[38] Jun Morimoto,et al. Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning , 2000, Robotics Auton. Syst..

[39] Roland E. Suri,et al. Temporal Difference Model Reproduces Anticipatory Neural Activity , 2001, Neural Computation.

[40] W. Schultz. Getting Formal with Dopamine and Reward , 2002, Neuron.

[41] Y. Niv,et al. Evolution of Reinforcement Learning in Uncertain Environments: A Simple Explanation for Complex Foraging Behaviors , 2002 .

[42] Kenji Doya,et al. Metalearning and neuromodulation , 2002, Neural Networks.

[43] Eytan Ruppin,et al. Actor-critic models of the basal ganglia: new anatomical and computational perspectives , 2002, Neural Networks.

[44] J. Leo van Hemmen,et al. Mapping time , 2002, Biological Cybernetics.

[45] John N. J. Reynolds,et al. Dopamine-dependent plasticity of corticostriatal synapses , 2002, Neural Networks.

[46] Y. Dan,et al. Spike-timing-dependent synaptic modification induced by natural spike trains , 2002, Nature.

[47] Florentin Wörgötter,et al. Isotropic Sequence Order Learning , 2003, Neural Computation.

[48] Karl J. Friston,et al. Temporal Difference Models and Reward-Related Learning in the Human Brain , 2003, Neuron.

[49] H. Seung,et al. Learning in Spiking Neural Networks by Reinforcement of Stochastic Synaptic Transmission , 2003, Neuron.

[50] Vijay R. Konda,et al. OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[51] E. Kandel,et al. Activity-Dependent Presynaptic Facilitation and Hebbian LTP Are Both Required and Interact during Classical Conditioning in Aplysia , 2003, Neuron.

[52] Xiaohui Xie,et al. Learning in neural networks by reinforcement of irregular spiking. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[53] Karl J. Friston,et al. Dissociable Roles of Ventral and Dorsal Striatum in Instrumental Conditioning , 2004, Science.

[54] Patrick D. Roberts,et al. Computational Consequences of Temporally Asymmetric Learning Rules: I. Differential Hebbian Learning , 1999, Journal of Computational Neuroscience.

[55] Patrick D. Roberts,et al. Computational Consequences of Temporally Asymmetric Learning Rules: II. Sensory Image Cancellation , 2000, Journal of Computational Neuroscience.

[56] M. Delgado,et al. Modulation of Caudate Activity by Action Contingency , 2004, Neuron.

[57] Peter Dayan,et al. Temporal difference models describe higher-order learning in humans , 2004, Nature.

[58] Daniel Lehmann,et al. Modeling Compositionality by Dynamic Binding of Synfire Chains , 2004, Journal of Computational Neuroscience.

[59] S. Thorpe,et al. Spike times make sense , 2005, Trends in Neurosciences.

[60] Florentin Wörgötter,et al. Temporal Sequence Learning, Prediction, and Control: A Review of Different Models and Their Relation to Biological Mechanisms , 2005, Neural Computation.

[61] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[62] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[63] Rémi Munos,et al. Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[64] W. Gerstner,et al. Triplets of Spikes in a Model of Spike Timing-Dependent Plasticity , 2006, The Journal of Neuroscience.

[65] R. Dolan,et al. Dopamine-dependent prediction errors underpin reward-seeking behaviour in humans , 2006, Nature.

[66] Markus Diesmann,et al. Programmable Logic Construction Kits for Hyper-Real-Time Neuronal Modeling , 2006, Neural Computation.

[67] E. Vaadia,et al. Midbrain dopamine neurons encode decisions for future action , 2006, Nature Neuroscience.

[68] Stefan Philipp,et al. Interconnecting VLSI Spiking Neural Networks Using Isochronous Connections , 2007, IWANN.

[69] Razvan V. Florian,et al. Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity , 2007, Neural Computation.

[70] R. O’Reilly,et al. Separate neural substrates for skill learning and performance in the ventral and dorsal striatum , 2007, Nature Neuroscience.

[71] E. Izhikevich. Solving the distal reward problem through linkage of STDP and dopamine signaling , 2007, BMC Neuroscience.

[72] Marc-Oliver Gewaltig,et al. NEST (NEural Simulation Tool) , 2007, Scholarpedia.

[73] Johannes Schemmel,et al. Spike-Frequency Adapting Neural Ensembles: Beyond Mean Adaptation and Renewal Theories , 2007, Neural Computation.

[74] Florentin Wörgötter,et al. Learning with Relevance: Using a Third Factor to Stabilize Hebbian Learning , 2007, Neural Computation.

[75] B. Richmond,et al. Knowing without doing , 2007, Nature Neuroscience.

[76] Ron Meir,et al. Reinforcement Learning, Spike-Time-Dependent Plasticity, and the BCM Rule , 2007, Neural Computation.

[77] M. Farries,et al. Reinforcement learning with modulated spike timing dependent synaptic plasticity. , 2007, Journal of neurophysiology.

[78] B. Kosko. Differential Hebbian learning , 2008 .

[79] Wulfram Gerstner,et al. Phenomenological models of synaptic plasticity based on spike timing , 2008, Biological Cybernetics.

[80] Yoshua Bengio,et al. Alternative time representation in dopamine models , 2009, Journal of Computational Neuroscience.

[81] Minija Tamosiunaite,et al. On the Asymptotic Equivalence Between Differential Hebbian and Temporal Difference Learning , 2008, Neural Computation.

[82] Hiroyuki Nakahara,et al. Internal-Time Temporal Difference Model for Neural Value-Based Decision Making , 2010, Neural Computation.

[83] Markus Diesmann,et al. Compositionality of arm movements can be realized by propagating synchrony , 2010, Journal of Computational Neuroscience.

[84] Jean-Marc Fellous,et al. Computational models of reinforcement learning: the role of dopamine as a reward signal , 2010, Cognitive Neurodynamics.

[85] Chris Christodoulou,et al. Multiagent Reinforcement Learning: Spiking and Nonspiking Agents in the Iterated Prisoner's Dilemma , 2011, IEEE Transactions on Neural Networks.