论文信息 - A distributional code for value in dopamine-based reinforcement learning - 字舞流文

A distributional code for value in dopamine-based reinforcement learning

Since its introduction, the reward prediction error theory of dopamine has explained a wealth of empirical phenomena, providing a unifying framework for understanding the representation of reward and value in the brain 1 – 3 . According to the now canonical theory, reward predictions are represented as a single scalar quantity, which supports learning about the expectation, or mean, of stochastic outcomes. Here we propose an account of dopamine-based reinforcement learning inspired by recent artificial intelligence research on distributional reinforcement learning 4 – 6 . We hypothesized that the brain represents possible future rewards not as a single mean, but instead as a probability distribution, effectively representing multiple future outcomes simultaneously and in parallel. This idea implies a set of empirical predictions, which we tested using single-unit recordings from mouse ventral tegmental area. Our findings provide strong evidence for a neural realization of distributional reinforcement learning. Analyses of single-cell recordings from mouse ventral tegmental area are consistent with a model of reinforcement learning in which the brain represents possible future rewards not as a single mean of stochastic outcomes, as in the canonical model, but instead as a probability distribution.

Zeb Kurth-Nelson | Clara Kwon Starkweather | Naoshige Uchida | Demis Hassabis | Matthew Botvinick | Will Dabney | Rémi Munos | D. Hassabis | R. Munos | Will Dabney | M. Botvinick | Z. Kurth-Nelson | N. Uchida | C. Starkweather

[1] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[2] Yael Niv,et al. Opening Burton's Clock: Psychiatric insights from computational cognitive models , 2018 .

[3] Tyrone D. Cannon,et al. Striatal dopamine D1 and D2 receptor balance in twins at increased genetic risk for schizophrenia , 2006, Psychiatry Research: Neuroimaging.

[4] Naoshige Uchida,et al. Arithmetic and local circuitry underlying dopamine prediction errors , 2015, Nature.

[5] Rafal Bogacz,et al. Learning Reward Uncertainty in the Basal Ganglia , 2016, PLoS Comput. Biol..

[6] P. Dayan,et al. A computational and neural model of momentary subjective well-being , 2014, Proceedings of the National Academy of Sciences.

[7] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[8] N. Uchida,et al. Neural Circuitry of Reward Prediction Error. , 2017, Annual review of neuroscience.

[9] A. Pouget,et al. Probabilistic brains: knowns and unknowns , 2013, Nature Neuroscience.

[10] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[11] Masashi Sugiyama,et al. Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[12] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[13] Anne E Carpenter,et al. Neuron-type specific signals for reward and punishment in the ventral tegmental area , 2011, Nature.

[14] Rémi Munos,et al. Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[15] B. Hoffer,et al. Characterization of a mouse strain expressing Cre recombinase from the 3′ untranslated region of the dopamine transporter locus , 2006, Genesis.

[16] Naoshige Uchida,et al. Habenula Lesions Reveal that Multiple Mechanisms Underlie Dopamine Prediction Errors , 2015, Neuron.

[17] Marc G. Bellemare,et al. Statistics and Samples in Distributional Reinforcement Learning , 2019, ICML.

[18] N. Uchida,et al. Dopamine neurons share common response function for reward prediction error , 2016, Nature Neuroscience.

[19] S. Lammel,et al. Reward and aversion in a heterogeneous midbrain dopamine system , 2014, Neuropharmacology.

[20] Johanna F. Ziegel,et al. COHERENCE AND ELICITABILITY , 2013, 1303.1690.

[21] Minryung R. Song,et al. Multiphasic Temporal Dynamics in Responses of Midbrain Dopamine Neurons to Appetitive and Aversive Stimuli , 2013, The Journal of Neuroscience.

[22] Marc G. Bellemare,et al. Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[23] Matthew W. Hoffman,et al. Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[24] W. Schultz,et al. The phasic dopamine signal maturing: from reward via behavioural activation to formal economic utility , 2017, Current Opinion in Neurobiology.

[25] W. Newey,et al. Asymmetric Least Squares Estimation and Testing , 1987 .

[26] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[27] P. Glimcher. Understanding dopamine and reinforcement learning: The dopamine reward prediction error hypothesis , 2011, Proceedings of the National Academy of Sciences.

[28] Pedro Rosa-Neto,et al. Gradients of dopamine D1- and D2/3-binding sites in the basal ganglia of pig and monkey measured by PET , 2004, NeuroImage.

[29] M. C. Jones. Expectiles and M-quantiles are quantiles , 1994 .

[30] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[31] William R. Stauffer,et al. Dopamine Reward Prediction Error Responses Reflect Marginal Utility , 2014, Current Biology.

[32] M. Botvinick,et al. Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective , 2009, Cognition.

[33] Xiao-Jing Wang,et al. Reward-based training of recurrent neural networks for cognitive and value-based tasks , 2016, bioRxiv.

[34] Joel Z. Leibo,et al. Prefrontal cortex as a meta-reinforcement learning system , 2018, bioRxiv.

[35] E. Perry,et al. Dopaminergic activities in the human striatum: rostrocaudal gradients of uptake sites and of D1 and D2 but not of D3 receptor binding or dopamine , 1999, Neuroscience.

[36] Michael J. Frank,et al. By Carrot or by Stick: Cognitive Reinforcement Learning in Parkinsonism , 2004, Science.

[37] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[38] P. Dayan,et al. Depression: a decision-theoretic analysis. , 2015, Annual review of neuroscience.

[39] W. Schultz,et al. Discrete Coding of Reward Probability and Uncertainty by Dopamine Neurons , 2003, Science.