A distributional code for value in dopamine-based reinforcement learning

Since its introduction, the reward prediction error theory of dopamine has explained a wealth of empirical phenomena, providing a unifying framework for understanding the representation of reward and value in the brain 1 – 3 . According to the now canonical theory, reward predictions are represented as a single scalar quantity, which supports learning about the expectation, or mean, of stochastic outcomes. Here we propose an account of dopamine-based reinforcement learning inspired by recent artificial intelligence research on distributional reinforcement learning 4 – 6 . We hypothesized that the brain represents possible future rewards not as a single mean, but instead as a probability distribution, effectively representing multiple future outcomes simultaneously and in parallel. This idea implies a set of empirical predictions, which we tested using single-unit recordings from mouse ventral tegmental area. Our findings provide strong evidence for a neural realization of distributional reinforcement learning. Analyses of single-cell recordings from mouse ventral tegmental area are consistent with a model of reinforcement learning in which the brain represents possible future rewards not as a single mean of stochastic outcomes, as in the canonical model, but instead as a probability distribution.

[1]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[2]  Yael Niv,et al.  Opening Burton's Clock: Psychiatric insights from computational cognitive models , 2018 .

[3]  Tyrone D. Cannon,et al.  Striatal dopamine D1 and D2 receptor balance in twins at increased genetic risk for schizophrenia , 2006, Psychiatry Research: Neuroimaging.

[4]  Naoshige Uchida,et al.  Arithmetic and local circuitry underlying dopamine prediction errors , 2015, Nature.

[5]  Rafal Bogacz,et al.  Learning Reward Uncertainty in the Basal Ganglia , 2016, PLoS Comput. Biol..

[6]  P. Dayan,et al.  A computational and neural model of momentary subjective well-being , 2014, Proceedings of the National Academy of Sciences.

[7]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[8]  N. Uchida,et al.  Neural Circuitry of Reward Prediction Error. , 2017, Annual review of neuroscience.

[9]  A. Pouget,et al.  Probabilistic brains: knowns and unknowns , 2013, Nature Neuroscience.

[10]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[11]  Masashi Sugiyama,et al.  Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[12]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[13]  Anne E Carpenter,et al.  Neuron-type specific signals for reward and punishment in the ventral tegmental area , 2011, Nature.

[14]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[15]  B. Hoffer,et al.  Characterization of a mouse strain expressing Cre recombinase from the 3′ untranslated region of the dopamine transporter locus , 2006, Genesis.

[16]  Naoshige Uchida,et al.  Habenula Lesions Reveal that Multiple Mechanisms Underlie Dopamine Prediction Errors , 2015, Neuron.

[17]  Marc G. Bellemare,et al.  Statistics and Samples in Distributional Reinforcement Learning , 2019, ICML.

[18]  N. Uchida,et al.  Dopamine neurons share common response function for reward prediction error , 2016, Nature Neuroscience.

[19]  S. Lammel,et al.  Reward and aversion in a heterogeneous midbrain dopamine system , 2014, Neuropharmacology.

[20]  Johanna F. Ziegel,et al.  COHERENCE AND ELICITABILITY , 2013, 1303.1690.

[21]  Minryung R. Song,et al.  Multiphasic Temporal Dynamics in Responses of Midbrain Dopamine Neurons to Appetitive and Aversive Stimuli , 2013, The Journal of Neuroscience.

[22]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[23]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[24]  W. Schultz,et al.  The phasic dopamine signal maturing: from reward via behavioural activation to formal economic utility , 2017, Current Opinion in Neurobiology.

[25]  W. Newey,et al.  Asymmetric Least Squares Estimation and Testing , 1987 .

[26]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[27]  P. Glimcher Understanding dopamine and reinforcement learning: The dopamine reward prediction error hypothesis , 2011, Proceedings of the National Academy of Sciences.

[28]  Pedro Rosa-Neto,et al.  Gradients of dopamine D1- and D2/3-binding sites in the basal ganglia of pig and monkey measured by PET , 2004, NeuroImage.

[29]  M. C. Jones Expectiles and M-quantiles are quantiles , 1994 .

[30]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[31]  William R. Stauffer,et al.  Dopamine Reward Prediction Error Responses Reflect Marginal Utility , 2014, Current Biology.

[32]  M. Botvinick,et al.  Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective , 2009, Cognition.

[33]  Xiao-Jing Wang,et al.  Reward-based training of recurrent neural networks for cognitive and value-based tasks , 2016, bioRxiv.

[34]  Joel Z. Leibo,et al.  Prefrontal cortex as a meta-reinforcement learning system , 2018, bioRxiv.

[35]  E. Perry,et al.  Dopaminergic activities in the human striatum: rostrocaudal gradients of uptake sites and of D1 and D2 but not of D3 receptor binding or dopamine , 1999, Neuroscience.

[36]  Michael J. Frank,et al.  By Carrot or by Stick: Cognitive Reinforcement Learning in Parkinsonism , 2004, Science.

[37]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[38]  P. Dayan,et al.  Depression: a decision-theoretic analysis. , 2015, Annual review of neuroscience.

[39]  W. Schultz,et al.  Discrete Coding of Reward Probability and Uncertainty by Dopamine Neurons , 2003, Science.