Dopamine Cells Respond to Predicted Events during Classical Conditioning: Evidence for Eligibility Traces in the Reward-Learning Network

Behavioral conditioning of cue-reward pairing results in a shift of midbrain dopamine (DA) cell activity from responding to the reward to responding to the predictive cue. However, the precise time course and mechanism underlying this shift remain unclear. Here, we report a combined single-unit recording and temporal difference (TD) modeling approach to this question. The data from recordings in conscious rats showed that DA cells retain responses to predicted reward after responses to conditioned cues have developed, at least early in training. This contrasts with previous TD models that predict a gradual stepwise shift in latency with responses to rewards lost before responses develop to the conditioned cue. By exploring the TD parameter space, we demonstrate that the persistent reward responses of DA cells during conditioning are only accurately replicated by a TD model with long-lasting eligibility traces (nonzero values for the parameter λ) and low learning rate (α). These physiological constraints for TD parameters suggest that eligibility traces and low per-trial rates of plastic modification may be essential features of neural circuits for reward learning in the brain. Such properties enable rapid but stable initiation of learning when the number of stimulus-reward pairings is limited, conferring significant adaptive advantages in real-world environments.

[1]  L. S. Kogan Review of Principles of Behavior. , 1943 .

[2]  B. Skinner,et al.  Principles of Behavior , 1944 .

[3]  E. Fischer Conditioned Reflexes , 1942, American journal of physical medicine.

[4]  R. Roth,et al.  Comparison of effects of L-dopa, amphetamine and apomorphine on firing rate of rat dopaminergic neurones. , 1973, Nature: New biology.

[5]  A. Grace,et al.  Nigral dopamine neurons: intracellular recording and identification with L-dopa injection and histofluorescence. , 1980, Science.

[6]  J. D. Miller,et al.  Mesencephalic dopaminergic unit activity in the behaviorally conditioned rat. , 1981, Life sciences.

[7]  R. Sutton,et al.  Simulation of anticipatory responses in classical conditioning by a neuron-like adaptive element , 1982, Behavioural Brain Research.

[8]  A. Grace,et al.  Intracellular and extracellular electrophysiology of nigral dopaminergic neurons—1. Identification and characterization , 1983, Neuroscience.

[9]  G. Paxinos,et al.  The Rat Brain in Stereotaxic Coordinates , 1983 .

[10]  W. Schultz,et al.  The activity of pars compacta neurons of the monkey substantia nigra is depressed by apomorphine , 1984, Neuroscience Letters.

[11]  W. Schultz Responses of midbrain dopamine neurons to behavioral trigger stimuli in the monkey. , 1986, Journal of neurophysiology.

[12]  A. Klopf A neuronal model of classical conditioning , 1988 .

[13]  W. Schultz,et al.  Dopamine neurons of the monkey midbrain: contingencies of responses to active touch during self-initiated arm movements. , 1990, Journal of neurophysiology.

[14]  Richard S. Sutton,et al.  Time-Derivative Models of Pavlovian Reinforcement , 1990 .

[15]  M. Gabriel,et al.  Learning and Computational Neuroscience: Foundations of Adaptive Networks , 1990 .

[16]  W. Schultz,et al.  Responses of monkey dopamine neurons during learning of behavioral reactions. , 1992, Journal of neurophysiology.

[17]  W. Schultz,et al.  Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task , 1993, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[18]  Joel L. Davis,et al.  A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[19]  J. Chapin,et al.  Behavioral associations of neuronal activity in the ventral tegmental area of the rat , 1994, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[20]  W. Schultz,et al.  Importance of unpredictability for reward responses in primate dopamine neurons. , 1994, Journal of neurophysiology.

[21]  Pawel Cichosz,et al.  Truncating Temporal Differences: On the Efficient Implementation of TD(lambda) for Reinforcement Learning , 1994, J. Artif. Intell. Res..

[22]  A. Barto Adaptive Critics and the Basal Ganglia , 1995 .

[23]  Joel L. Davis,et al.  In : Models of Information Processing in the Basal Ganglia , 2008 .

[24]  Pawea Cichosz Truncating Temporal Diierences: on the Eecient Implementation of Td() for Reinforcement Learning , 1995 .

[25]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[26]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[27]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[28]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[29]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[30]  W. Schultz,et al.  Learning of sequential movements by neural network model with dopamine-like reinforcement signal , 1998, Experimental Brain Research.

[31]  J. Hollerman,et al.  Dopamine neurons report an error in the temporal prediction of reward during learning , 1998, Nature Neuroscience.

[32]  W. Schultz,et al.  A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task , 1999, Neuroscience.

[33]  Rajesh P. N. Rao,et al.  Spike-Timing-Dependent Hebbian Plasticity as Temporal Difference Learning , 2001, Neural Computation.

[34]  W. Schultz,et al.  Dopamine responses comply with basic assumptions of formal learning theory , 2001, Nature.

[35]  G. Rebec,et al.  Impulse activity of ventral tegmental area neurons during heroin self-administration in rats , 2001, Neuroscience.

[36]  R. Suri Anticipatory responses of dopamine neurons and cortical neurons reproduced by internal model , 2001, Experimental Brain Research.

[37]  Roland E. Suri,et al.  Temporal Difference Model Reproduces Anticipatory Neural Activity , 2001, Neural Computation.

[38]  B. Hyland,et al.  Firing modes of midbrain dopamine cells in the freely moving rat , 2002, Neuroscience.

[39]  W. Schultz Getting Formal with Dopamine and Reward , 2002, Neuron.

[40]  David S. Touretzky,et al.  Timing and Partial Observability in the Dopamine System , 2002, NIPS.

[41]  Karl J. Friston,et al.  Temporal Difference Models and Reward-Related Learning in the Human Brain , 2003, Neuron.

[42]  W. Schultz,et al.  Discrete Coding of Reward Probability and Uncertainty by Dopamine Neurons , 2003, Science.

[43]  O. Hikosaka,et al.  Dopamine Neurons Can Represent Context-Dependent Prediction Error , 2004, Neuron.

[44]  Karl J. Friston,et al.  Dissociable Roles of Ventral and Dorsal Striatum in Instrumental Conditioning , 2004, Science.

[45]  B. Bunney,et al.  Dopamine “Autoreceptors”: Pharmacological characterization by microiontophoretic single cell recording studies , 1977, Naunyn-Schmiedeberg's Archives of Pharmacology.

[46]  O. Hikosaka,et al.  A possible role of midbrain dopamine neurons in short- and long-term adaptation of saccades to position-reward mapping. , 2004, Journal of neurophysiology.

[47]  Gerald Tesauro,et al.  Practical issues in temporal difference learning , 1992, Machine Learning.

[48]  Peter Dayan,et al.  Temporal difference models describe higher-order learning in humans , 2004, Nature.

[49]  E. Vaadia,et al.  Coincident but Distinct Messages of Midbrain Dopamine and Striatal Tonically Active Neurons , 2004, Neuron.

[50]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.