Spatio-Temporal Credit Assignment in Neuronal Population Learning

In learning from trial and error, animals need to relate behavioral decisions to environmental reinforcement even though it may be difficult to assign credit to a particular decision when outcomes are uncertain or subject to delays. When considering the biophysical basis of learning, the credit-assignment problem is compounded because the behavioral decisions themselves result from the spatio-temporal aggregation of many synaptic releases. We present a model of plasticity induction for reinforcement learning in a population of leaky integrate and fire neurons which is based on a cascade of synaptic memory traces. Each synaptic cascade correlates presynaptic input first with postsynaptic events, next with the behavioral decisions and finally with external reinforcement. For operant conditioning, learning succeeds even when reinforcement is delivered with a delay so large that temporal contiguity between decision and pertinent reward is lost due to intervening decisions which are themselves subject to delayed reinforcement. This shows that the model provides a viable mechanism for temporal credit assignment. Further, learning speeds up with increasing population size, so the plasticity cascade simultaneously addresses the spatial problem of assigning credit to synapses in different population neurons. Simulations on other tasks, such as sequential decision making, serve to contrast the performance of the proposed scheme to that of temporal difference-based learning. We argue that, due to their comparative robustness, synaptic plasticity cascades are attractive basic models of reinforcement learning in the brain.

[1]  E. Fetz,et al.  Operantly conditioned patterns on precentral unit activity and correlated responses in adjacent cells and contralateral muscles. , 1973, Journal of neurophysiology.

[2]  P. Chance Learning and Behavior , 1979 .

[3]  T. Teyler Long-term potentiation and memory. , 1987, International journal of neurology.

[4]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[5]  H. Markram,et al.  Regulation of Synaptic Efficacy by Coincidence of Postsynaptic APs and EPSPs , 1997, Science.

[6]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[7]  G. Bi,et al.  Synaptic Modifications in Cultured Hippocampal Neurons: Dependence on Spike Timing, Synaptic Strength, and Postsynaptic Cell Type , 1998, The Journal of Neuroscience.

[8]  W Baum,et al.  Optimality And Concurrent Variable-interval Variable-ratio Schedules. , 1999, Journal of the experimental analysis of behavior.

[9]  R. Foehring,et al.  Neuromodulation, development and synaptic plasticity. , 1999, Canadian journal of experimental psychology = Revue canadienne de psychologie experimentale.

[10]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[11]  R. Zemel,et al.  Information processing with population codes , 2000, Nature Reviews Neuroscience.

[12]  L. Abbott,et al.  Competitive Hebbian learning through spike-timing-dependent synaptic plasticity , 2000, Nature Neuroscience.

[13]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[14]  L. Abbott,et al.  Cortical Development and Remapping through Spike Timing-Dependent Plasticity , 2001, Neuron.

[15]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[16]  Xiao-Jing Wang,et al.  Probabilistic Decision Making by Slow Reverberation in Cortical Circuits , 2002, Neuron.

[17]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[18]  Robert A. Legenstein,et al.  What Can a Neuron Learn with Spike-Timing-Dependent Plasticity? , 2005, Neural Computation.

[19]  J. Meldolesi,et al.  Astrocytes, from brain glue to communication elements: the revolution continues , 2005, Nature Reviews Neuroscience.

[20]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[21]  Y. Matsuda,et al.  The Presence of Background Dopamine Signal Converts Long-Term Synaptic Depression to Potentiation in Rat Prefrontal Cortex , 2006, The Journal of Neuroscience.

[22]  Jean-Pascal Pfister,et al.  Optimal Spike-Timing-Dependent Plasticity for Precise Action Potential Firing in Supervised Learning , 2005, Neural Computation.

[23]  A. Pouget,et al.  Neural correlations, population coding and computation , 2006, Nature Reviews Neuroscience.

[24]  K. Doya,et al.  The computational neurobiology of learning and reward , 2006, Current Opinion in Neurobiology.

[25]  B. Sakmann,et al.  Spine Ca2+ Signaling in Spike-Timing-Dependent Plasticity , 2006, The Journal of Neuroscience.

[26]  Razvan V. Florian,et al.  Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity , 2007, Neural Computation.

[27]  E. Izhikevich Solving the distal reward problem through linkage of STDP and dopamine signaling , 2007, BMC Neuroscience.

[28]  A. Kirkwood,et al.  Neuromodulators Control the Polarity of Spike-Timing-Dependent Synaptic Plasticity , 2007, Neuron.

[29]  Ron Meir,et al.  Reinforcement Learning, Spike-Time-Dependent Plasticity, and the BCM Rule , 2007, Neural Computation.

[30]  W. Abraham Metaplasticity: tuning synapses and networks for plasticity , 2008, Nature Reviews Neuroscience.

[31]  Robert A. Legenstein,et al.  A Learning Theory for Reward-Modulated Spike-Timing-Dependent Plasticity with Application to Biofeedback , 2008, PLoS Comput. Biol..

[32]  Ron Meir,et al.  Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation , 2008, NIPS.

[33]  P. Dayan,et al.  Reinforcement learning: The Good, The Bad and The Ugly , 2008, Current Opinion in Neurobiology.

[34]  J. Frey,et al.  'Synaptic tagging' and 'cross-tagging' and related associative reinforcement processes of functional plasticity as the cellular basis for memory formation. , 2008, Progress in brain research.

[35]  J. Kerr,et al.  Dopamine Receptor Activation Is Required for Corticostriatal Spike-Timing-Dependent Plasticity , 2008, The Journal of Neuroscience.

[36]  G. Bi,et al.  Gain in sensitivity and loss in temporal contrast of STDP by dopaminergic modulation at hippocampal synapses , 2009, Proceedings of the National Academy of Sciences.

[37]  Yonatan Loewenstein,et al.  Learning reward timing in cortex through reward dependent expression of synaptic plasticity , 2009, Proceedings of the National Academy of Sciences.

[38]  W. Senn,et al.  Reinforcement learning in populations of spiking neurons , 2008, Nature Neuroscience.

[39]  Wulfram Gerstner,et al.  Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail , 2009, PLoS Comput. Biol..

[40]  Walter Senn,et al.  Learning Spike-Based Population Codes by Reward and Population Feedback , 2010, Neural Computation.

[41]  S. Oliet,et al.  Long term potentiation depends on release of D-serine from astrocytes , 2009, Nature.

[42]  W. Gerstner,et al.  Connectivity reflects coding: a model of voltage-based STDP with homeostasis , 2010, Nature Neuroscience.

[43]  J. Frey,et al.  Differential effects of electrical stimulation patterns, motivational-behavioral stimuli and their order of application on functional plasticity processes within one input in the dentate gyrus of freely moving rats in vivo , 2010, Neuroscience.