The role of state uncertainty in the dynamics of dopamine

Reinforcement learning models of the basal ganglia map the phasic dopamine signal to reward prediction errors (RPEs). Conventional models assert that, when a stimulus reliably predicts a reward with fixed delay, dopamine activity during the delay period and at reward time should converge to baseline through learning. However, recent studies have found that dopamine exhibits a gradual ramp before reward in certain conditions even after extensive learning, such as when animals are trained to run to obtain the reward, thus challenging the conventional RPE models. In this work, we begin with the limitation of temporal uncertainty (animals cannot perfectly estimate time to reward), and show that sensory feedback, which reduces this uncertainty, will cause an unbiased learner to produce RPE ramps. On the other hand, in the absence of feedback, RPEs will be flat after learning. These results reconcile the seemingly conflicting data on dopamine behaviors under the RPE hypothesis.

[1]  Richard S. Sutton,et al.  Time-Derivative Models of Pavlovian Reinforcement , 1990 .

[2]  Mehdi Khamassi,et al.  Dopamine blockade impairs the exploration-exploitation trade-off in rats , 2019, Scientific Reports.

[3]  S. Gershman,et al.  Dopamine reward prediction errors reflect hidden state inference across time , 2017, Nature Neuroscience.

[4]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[5]  R. Romo,et al.  Dopamine neurons code subjective sensory experience and uncertainty of perceptual decisions , 2011, Proceedings of the National Academy of Sciences.

[6]  Anne L. Collins,et al.  Dynamic mesolimbic dopamine signaling during action sequence learning and expectation violation , 2016, Scientific Reports.

[7]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[8]  W. Schultz Multiple dopamine functions at different time courses. , 2007, Annual review of neuroscience.

[9]  R. Malenka,et al.  Dopaminergic modulation of neuronal excitability in the striatum and nucleus accumbens. , 2000, Annual review of neuroscience.

[10]  Y. Niv,et al.  Dialogues on prediction errors , 2008, Trends in Cognitive Sciences.

[11]  P. Glimcher Understanding dopamine and reinforcement learning: The dopamine reward prediction error hypothesis , 2011, Proceedings of the National Academy of Sciences.

[12]  R. Joosten,et al.  Reward-Predictive Cues Enhance Excitatory Synaptic Strength onto Midbrain Dopamine Neurons , 2008, Science.

[13]  A. Graybiel,et al.  Prolonged Dopamine Signalling in Striatum Signals Proximity and Value of Distant Rewards , 2013, Nature.

[14]  P. Dayan,et al.  Tonic dopamine: opportunity costs and the control of response vigor , 2007, Psychopharmacology.

[15]  Elliot A. Ludvig,et al.  Evaluating the TD model of classical conditioning , 2012, Learning & behavior.

[16]  Joel L. Davis,et al.  Adaptive Critics and the Basal Ganglia , 1995 .

[17]  S. Gershman,et al.  The Medial Prefrontal Cortex Shapes Dopamine Reward Prediction Errors under State Uncertainty , 2018, Neuron.

[18]  B. Moghaddam,et al.  Distinct prestimulus and poststimulus activation of VTA neurons correlates with stimulus detection. , 2013, Journal of neurophysiology.

[19]  Kenji Morita,et al.  Striatal dopamine ramping may indicate flexible reinforcement learning with forgetting in the cortico-basal ganglia circuits , 2014, Front. Neural Circuits.

[20]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[21]  J. O’Keefe,et al.  Geometric determinants of the place fields of hippocampal neurons , 1996, Nature.

[22]  W. Schultz Behavioral dopamine signals , 2007, Trends in Neurosciences.

[23]  Joel L. Davis,et al.  A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[24]  J. Berke What does dopamine mean? , 2018, Nature Neuroscience.

[25]  Samuel J. Gershman,et al.  A Unified Framework for Dopamine Signals across Timescales , 2019, Cell.

[26]  T. Robinson,et al.  A selective role for dopamine in reward learning , 2010, Nature.

[27]  J. W. Moore,et al.  Adaptively timed conditioned responses and the cerebellum: A neural network approach , 1989, Biological Cybernetics.

[28]  Sachie K. Ogawa,et al.  Dopamine neurons projecting to the posterior striatum form an anatomically distinct subclass , 2015, eLife.

[29]  Adam Kepecs,et al.  Midbrain Dopamine Neurons Signal Belief in Choice Accuracy during a Perceptual Decision , 2017, Current Biology.

[30]  N. Uchida,et al.  Opposite initialization to novel cues in dopamine signaling in ventral and posterior striatum in mice , 2016, eLife.

[31]  W. Schultz,et al.  Dopamine signals for reward value and risk: basic and recent data , 2010, Behavioral and Brain Functions.

[32]  X. Zhuang,et al.  Faculty Opinions recommendation of A selective role for dopamine in stimulus-reward learning. , 2010 .

[33]  W. Schultz,et al.  Influence of Reward Delays on Responses of Dopamine Neurons , 2008, The Journal of Neuroscience.

[34]  J. Gibbon Scalar expectancy theory and Weber's law in animal timing. , 1977 .

[35]  W. Schultz Midbrain Dopamine Neurons , 2009 .

[36]  K. Berridge The debate over dopamine’s role in reward: the case for incentive salience , 2007, Psychopharmacology.

[37]  Richard S. Sutton,et al.  Stimulus Representation and the Timing of Reward-Prediction Errors in Models of the Dopamine System , 2008, Neural Computation.

[38]  S. Gershman,et al.  Belief state representation in the dopamine system , 2018, Nature Communications.

[39]  Naoshige Uchida,et al.  Arithmetic and local circuitry underlying dopamine prediction errors , 2015, Nature.

[40]  Rafal Bogacz,et al.  Learning Reward Uncertainty in the Basal Ganglia , 2016, PLoS Comput. Biol..

[41]  W. Newsome,et al.  The temporal precision of reward prediction in dopamine neurons , 2008, Nature Neuroscience.

[42]  S. Ostlund,et al.  Phasic Mesolimbic Dopamine Signaling Precedes and Predicts Performance of a Self-Initiated Action Sequence Task , 2012, Biological Psychiatry.

[43]  Samuel Gershman,et al.  Dopamine Ramps Are a Consequence of Reward Prediction Errors , 2014, Neural Computation.

[44]  R. Church A concise introduction to scalar timing theory. , 2003 .

[45]  J. Staddon,et al.  SOME PROPERTIES OF SPACED RESPONDING IN PIGEONS. , 1965, Journal of the experimental analysis of behavior.

[46]  Peter Dayan,et al.  Tamping Ramping: Algorithmic, Implementational, and Computational Explanations of Phasic Dopamine Signals in the Accumbens , 2015, PLoS Comput. Biol..

[47]  Vaughn L. Hetrick,et al.  Mesolimbic Dopamine Signals the Value of Work , 2015, Nature Neuroscience.

[48]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[49]  Michael J. Frank,et al.  Genetic triple dissociation reveals multiple roles for dopamine in reinforcement learning , 2007, Proceedings of the National Academy of Sciences.

[50]  Roger Ratcliff,et al.  Reinforcement-Based Decision Making in Corticostriatal Circuits: Mutual Constraints by Neurocomputational and Diffusion Models , 2012, Neural Computation.

[51]  Anne E Carpenter,et al.  Neuron-type specific signals for reward and punishment in the ventral tegmental area , 2011, Nature.

[52]  Geoffrey Schoenbaum,et al.  Rethinking dopamine as generalized prediction error , 2018, bioRxiv.

[53]  Josiah R. Boivin,et al.  A Causal Link Between Prediction Errors, Dopamine Neurons and Learning , 2013, Nature Neuroscience.

[54]  P. Glimcher,et al.  Phasic Dopamine Release in the Rat Nucleus Accumbens Symmetrically Encodes a Reward Prediction Error Term , 2014, The Journal of Neuroscience.