A Local Temporal Difference Code for Distributional Reinforcement Learning

Recent theoretical and experimental results suggest that the dopamine system implements distributional temporal difference backups, allowing learning of the entire distributions of the long-run values of states rather than just their expected values. However, the distributional codes explored so far rely on a complex imputation step which crucially relies on spatial non-locality: in order to compute reward prediction errors, units must know not only their own state but also the states of the other units. It is far from clear how these steps could be implemented in realistic neural circuits. Here, we introduce the Laplace code: a local temporal difference code for distributional reinforcement learning that is representationally powerful and computationally straightforward. The code decomposes value distributions and prediction errors across three separated dimensions: reward magnitude (related to distributional quantiles), temporal discounting (related to the Laplace transform of future rewards) and time horizon (related to eligibility traces). Besides lending itself to a local learning rule, the decomposition recovers the temporal evolution of the immediate reward distribution, indicating all possible rewards at all future times. This increases representational capacity and allows for temporally-flexible computations that immediately adjust to changing horizons or discount factors.

[1]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[2]  Ilana B. Witten,et al.  Reward and choice encoding in terminals of midbrain dopamine neurons depends on striatal target , 2016, Nature Neuroscience.

[3]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[4]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[5]  Adam Kepecs,et al.  A computational framework for the study of confidence in humans and animals , 2012, Philosophical Transactions of the Royal Society B: Biological Sciences.

[6]  Marc G. Bellemare,et al.  Statistics and Samples in Distributional Reinforcement Learning , 2019, ICML.

[7]  Timothy E. J. Behrens,et al.  Choice, uncertainty and value in prefrontal and cingulate cortex , 2008, Nature Neuroscience.

[8]  Ethan S. Bromberg-Martin,et al.  Multiple Timescales of Memory in Lateral Habenula and Dopamine Neurons , 2010, Neuron.

[9]  I. J. Day On the inversion of diffusion NMR data: Tikhonov regularization and optimal choice of the regularization parameter. , 2011, Journal of magnetic resonance.

[10]  N. Uchida,et al.  Dopamine neurons share common response function for reward prediction error , 2016, Nature Neuroscience.

[11]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[12]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[13]  William R. Stauffer,et al.  Dopamine Reward Prediction Error Responses Reflect Marginal Utility , 2014, Current Biology.

[14]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[15]  Zeb Kurth-Nelson,et al.  A distributional code for value in dopamine-based reinforcement learning , 2020, Nature.

[16]  Yee Whye Teh,et al.  An Analysis of Categorical Distributional Reinforcement Learning , 2018, AISTATS.

[17]  Ryan Webb,et al.  Adaptive neural coding: from biological to behavioral decision-making , 2015, Current Opinion in Behavioral Sciences.

[18]  Joseph W. Barter,et al.  Beyond reward prediction errors: the role of dopamine in movement kinematics , 2015, Front. Integr. Neurosci..

[19]  Peter Dayan,et al.  Uncertainty in learning, choice, and visual fixation , 2019, Proceedings of the National Academy of Sciences.

[20]  Marc G. Bellemare,et al.  A Comparative Analysis of Expected and Distributional Reinforcement Learning , 2019, AAAI.

[21]  Saori C. Tanaka,et al.  Serotonin Differentially Regulates Short- and Long-Term Prediction of Rewards in the Ventral and Dorsal Striatum , 2007, PloS one.

[22]  A. Cooper,et al.  Predictive Reward Signal of Dopamine Neurons , 2011 .

[23]  Peter Dayan,et al.  Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems , 2001 .

[24]  Marc W. Howard,et al.  Scale Invariant Value Computation for Reinforcement Learning in Continuous Time , 2017, AAAI Spring Symposia.

[25]  Timothy E. J. Behrens,et al.  Learning the value of information in an uncertain world , 2007, Nature Neuroscience.

[26]  Division of Labor for Division: Inhibitory Interneurons with Different Spatial Landscapes in the Olfactory System , 2013, Neuron.

[27]  Marc W. Howard,et al.  A Scale-Invariant Internal Representation of Time , 2012, Neural Computation.

[28]  Adrienne L. Fairhall,et al.  Intrinsic Gain Modulation and Adaptive Neural Coding , 2008, PLoS Comput. Biol..

[29]  Saori C. Tanaka,et al.  Serotonin Affects Association of Aversive Outcomes to Past Actions , 2009, The Journal of Neuroscience.

[30]  David J. Foster,et al.  Reverse replay of behavioural sequences in hippocampal place cells during the awake state , 2006, Nature.

[31]  Andrew E. Yagle Regularized Matrix Computations , 2005 .

[32]  Marc W. Howard,et al.  Predicting the Future with Multi-scale Successor Representations , 2018, bioRxiv.

[33]  Doina Precup,et al.  Knowledge Representation for Reinforcement Learning using General Value Functions , 2018 .

[34]  Ilana B. Witten,et al.  Specialized coding of sensory, motor, and cognitive variables in VTA dopamine neurons , 2019, Nature.

[35]  M. Botvinick,et al.  The successor representation in human reinforcement learning , 2016, Nature Human Behaviour.

[36]  Daeyeol Lee,et al.  Heterogeneous Coding of Temporally Discounted Values in the Dorsal and Ventral Striatum during Intertemporal Choice , 2011, Neuron.

[37]  Eero P. Simoncelli,et al.  Natural image statistics and neural representation. , 2001, Annual review of neuroscience.

[38]  Saori C. Tanaka,et al.  Serotonin and the Evaluation of Future Rewards , 2007, Annals of the New York Academy of Sciences.

[39]  A. Graybiel,et al.  Prolonged Dopamine Signalling in Striatum Signals Proximity and Value of Distant Rewards , 2013, Nature.