Temporal-Difference Reinforcement Learning with Distributed Representations

Temporal-difference (TD) algorithms have been proposed as models of reinforcement learning (RL). We examine two issues of distributed representation in these TD algorithms: distributed representations of belief and distributed discounting factors. Distributed representation of belief allows the believed state of the world to distribute across sets of equivalent states. Distributed exponential discounting factors produce hyperbolic discounting in the behavior of the agent itself. We examine these issues in the context of a TD RL model in which state-belief is distributed over a set of exponentially-discounting “micro-Agents”, each of which has a separate discounting factor (γ). Each µAgent maintains an independent hypothesis about the state of the world, and a separate value-estimate of taking actions within that hypothesized state. The overall agent thus instantiates a flexible representation of an evolving world-state. As with other TD models, the value-error (δ) signal within the model matches dopamine signals recorded from animals in standard conditioning reward-paradigms. The distributed representation of belief provides an explanation for the decrease in dopamine at the conditioned stimulus seen in overtrained animals, for the differences between trace and delay conditioning, and for transient bursts of dopamine seen at movement initiation. Because each µAgent also includes its own exponential discounting factor, the overall agent shows hyperbolic discounting, consistent with behavioral experiments.

[1]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[2]  G. Madden,et al.  Discounting of delayed rewards in opioid-dependent outpatients: exponential or hyperbolic discounting functions? , 1999, Experimental and clinical psychopharmacology.

[3]  J. E. Mazur Choice, delay, probability, and conditioned reinforcement , 1997 .

[4]  M. Kawato,et al.  Efficient reinforcement learning: computational theories, neuroscience and robotics , 2007, Current Opinion in Neurobiology.

[5]  P. Glimcher,et al.  Statistics of midbrain dopamine neuron spike trains in the awake primate. , 2007, Journal of neurophysiology.

[6]  Peter D. Sozou,et al.  On hyperbolic discounting and uncertain hazard rates , 1998, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[7]  Adam Johnson,et al.  Reconstruction of the postsubiculum head direction signal from neural ensembles , 2005, Hippocampus.

[8]  R. Wightman,et al.  Dopamine release is heterogeneous within microenvironments of the rat nucleus accumbens , 2007, The European journal of neuroscience.

[9]  Tracey J. Shors,et al.  Memory traces of trace memories: neurogenesis, synaptogenesis and awareness , 2004, Trends in Neurosciences.

[10]  A. Redish,et al.  Addiction as a Computational Process Gone Awry , 2004, Science.

[11]  J. E. Mazur Hyperbolic value addition and general models of animal choice. , 2001, Psychological review.

[12]  K. Allen,et al.  Dorsal, ventral, and complete excitotoxic lesions of the hippocampus in rats failed to impair appetitive trace conditioning , 2007, Behavioural Brain Research.

[13]  Peter L. Strick,et al.  Macro-organization of the circuits connecting the basal ganglia with the cortical motor areas , 1995 .

[14]  A. Dickinson,et al.  Neuronal coding of prediction errors. , 2000, Annual review of neuroscience.

[15]  Louis D. Matzel,et al.  The Role of the Hippocampus in Trace Conditioning: Temporal Discontinuity or Task Difficulty? , 2001, Neurobiology of Learning and Memory.

[16]  James C. Houk,et al.  Adaptive Critics and the Basal Ganglia , 1994 .

[17]  David S. Touretzky,et al.  Representation and Timing in Theories of the Dopamine System , 2006, Neural Computation.

[18]  Kenji Doya,et al.  Humans Can Adopt Optimal Discounting Strategy under Real-Time Constraints , 2006, PLoS Comput. Biol..

[19]  W. Schultz,et al.  Importance of unpredictability for reward responses in primate dopamine neurons. , 1994, Journal of neurophysiology.

[20]  A. Barto Adaptive Critics and the Basal Ganglia , 1995 .

[21]  N. Daw,et al.  Reinforcement learning models of the dopamine system and their behavioral implications , 2003 .

[22]  W. Schultz,et al.  Discrete Coding of Reward Probability and Uncertainty by Dopamine Neurons , 2003, Science.

[23]  William B. Levy,et al.  The formation of neural codes in the hippocampus: trace conditioning as a prototypical paradigm for studying the random recoding hypothesis , 2005, Biological Cybernetics.

[24]  Mitsuo Kawato,et al.  Inter-module credit assignment in modular reinforcement learning , 2003, Neural Networks.

[25]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[26]  K. Doya,et al.  Representation of Action-Specific Reward Values in the Striatum , 2005, Science.

[27]  D. Rubin,et al.  The Precise Time Course of Retention , 1999 .

[28]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[29]  W K Bickel,et al.  Polydrug abuse in heroin addicts: a behavioral economic analysis. , 1998, Addiction.

[30]  W. Schultz,et al.  A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task , 1999, Neuroscience.

[31]  R. Wightman,et al.  Coordinated Accumbal Dopamine Release and Neural Activity Drive Goal-Directed Behavior , 2007, Neuron.

[32]  W. Schultz,et al.  Responses of monkey dopamine neurons during learning of behavioral reactions. , 1992, Journal of neurophysiology.

[33]  Ann M. Graybiel,et al.  Striosomes and Matrisomes , 1991 .

[34]  W. Schultz Neural coding of basic reward terms of animal learning theory, game theory, microeconomics and behavioural ecology , 2004, Current Opinion in Neurobiology.

[35]  C. Gallistel,et al.  Time, rate, and conditioning. , 2000, Psychological review.

[36]  K. Doya Complementary roles of basal ganglia and cerebellum in learning and motor control , 2000, Current Opinion in Neurobiology.

[37]  Jadin C. Jackson,et al.  Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling. , 2007, Psychological review.

[38]  D. Vere-Jones Markov Chains , 1972, Nature.

[39]  D. Read Is Time-Discounting Hyperbolic or Subadditive? , 2001 .

[40]  A. Kacelnik Normative and descriptive models of decision making: time discounting and risk sensitivity. , 2007, Ciba Foundation symposium.

[41]  Kenji Doya,et al.  Multiple model-based reinforcement learning explains dopamine neuronal activity , 2007, Neural Networks.

[42]  P. Glimcher,et al.  Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal , 2005, Neuron.

[43]  G. E. Alexander,et al.  Parallel organization of functionally segregated circuits linking basal ganglia and cortex. , 1986, Annual review of neuroscience.

[44]  Karl J. Friston,et al.  Temporal Difference Models and Reward-Related Learning in the Human Brain , 2003, Neuron.

[45]  P Killeen,et al.  The matching law. , 1972, Journal of the experimental analysis of behavior.

[46]  A. Cooper,et al.  Predictive Reward Signal of Dopamine Neurons , 2011 .

[47]  G. E. Alexander,et al.  Functional architecture of basal ganglia circuits: neural substrates of parallel processing , 1990, Trends in Neurosciences.

[48]  Saori C. Tanaka,et al.  Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops , 2004, Nature Neuroscience.

[49]  T. Robbins,et al.  Impulsive Choice Induced in Rats by Lesions of the Nucleus Accumbens Core , 2001, Science.

[50]  David Laibson,et al.  An economic perspective on addiction and matching , 1996, Behavioral and Brain Sciences.

[51]  Amy L Odum,et al.  Discounting of delayed health gains and losses by current, never- and ex-smokers of cigarettes. , 2002, Nicotine & tobacco research : official journal of the Society for Research on Nicotine and Tobacco.

[52]  W. Schultz,et al.  Evidence that the delay-period activity of dopamine neurons corresponds to reward uncertainty rather than backpropagating TD errors , 2005, Behavioral and Brain Functions.

[53]  Richard S. Sutton,et al.  A computational model of hippocampal function in trace conditioning , 2008, NIPS.

[55]  Asohan Amarasingham,et al.  Internally Generated Cell Assembly Sequences in the Rat Hippocampus , 2008, Science.

[56]  Asohan Amarasingham,et al.  Hippocampus Internally Generated Cell Assembly Sequences in the Rat , 2011 .

[57]  A G Barto,et al.  Toward a modern theory of adaptive networks: expectation and prediction. , 1981, Psychological review.

[58]  Peter Dayan,et al.  Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems , 2001 .

[59]  F. Robert Jacobs,et al.  Batch Construction Heuristics and Storage Assignment Strategies for Walk/Rideand Pick Systems , 1999 .

[60]  R. Vuchinich,et al.  Hyperbolic temporal discounting in social drinkers and problem drinkers. , 1998, Experimental and clinical psychopharmacology.

[61]  S. Mitchell,et al.  Measures of impulsivity in cigarette smokers and non-smokers , 1999, Psychopharmacology.

[62]  David S. Touretzky,et al.  Dopamine and inference about timing , 2002, Proceedings 2nd International Conference on Development and Learning. ICDL 2002.

[63]  George Ainslie,et al.  A Marketplace in the Brain? , 2004, Science.

[64]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[65]  George Ainslie,et al.  Behavior. A marketplace in the brain? , 2004, Science.

[66]  Jadin C. Jackson,et al.  Detecting dynamical changes within a simulated neural ensemble using a measure of representational quality , 2003, Network.

[67]  R. Wightman,et al.  Associative learning mediates dynamic shifts in dopamine signaling in the nucleus accumbens , 2007, Nature Neuroscience.

[68]  Roland E. Suri,et al.  Temporal Difference Model Reproduces Anticipatory Neural Activity , 2001, Neural Computation.

[69]  W. Schultz,et al.  Responses of Monkey Dopamine Neurons to External Stimuli: Changes with Learning , 1991 .

[70]  S. M. Alessi,et al.  Pathological gambling severity is associated with impulsivity in a delay discounting procedure , 2003, Behavioural Processes.

[71]  W. Schultz Getting Formal with Dopamine and Reward , 2002, Neuron.

[72]  D. Whitteridge Lectures on Conditioned Reflexes , 1942, Nature.

[73]  Z. Kurth-Nelson,et al.  Neural Models of Temporal Discounting , 2009 .

[74]  R. Rescorla A theory of pavlovian conditioning: The effectiveness of reinforcement and non-reinforcement , 1972 .

[75]  Saori C. Tanaka,et al.  Serotonin and the Evaluation of Future Rewards , 2007, Annals of the New York Academy of Sciences.

[76]  W. F. Prokasy,et al.  Classical conditioning II: Current research and theory. , 1972 .

[77]  Kenji Doya,et al.  What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? , 1999, Neural Networks.

[78]  N. Mackintosh The psychology of animal learning , 1974 .

[79]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[80]  J. Hollerman,et al.  Dopamine neurons report an error in the temporal prediction of reward during learning , 1998, Nature Neuroscience.

[81]  Joel L. Davis,et al.  In : Models of Information Processing in the Basal Ganglia , 2008 .

[82]  Yael Niv,et al.  OPERANT CONDITIONING , 1974, Scholarpedia.

[83]  Sham M. Kakade,et al.  Opponent interactions between serotonin and dopamine , 2002, Neural Networks.

[84]  Peter Dayan,et al.  Dopamine: generalization and bonuses , 2002, Neural Networks.

[85]  Florentin Wörgötter,et al.  Temporal Sequence Learning, Prediction, and Control: A Review of Different Models and Their Relation to Biological Mechanisms , 2005, Neural Computation.

[86]  P. Kaplan,et al.  Bridging temporal gaps between CS and US in autoshaping: A test of a local context hypothesis , 1984 .

[87]  G. Bock,et al.  Characterizing human psychological adaptations , 1997 .

[88]  Jonathan D. Cohen,et al.  Neuroeconomics: cross-currents in research on decision-making , 2006, Trends in Cognitive Sciences.

[89]  Mitsuo Kawato,et al.  Multiple Model-Based Reinforcement Learning , 2002, Neural Computation.

[90]  W. Newsome,et al.  The temporal precision of reward prediction in dopamine neurons , 2008, Nature Neuroscience.

[91]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[92]  W. Pan,et al.  Dopamine Cells Respond to Predicted Events during Classical Conditioning: Evidence for Eligibility Traces in the Reward-Learning Network , 2005, The Journal of Neuroscience.

[93]  Zeb Kurth-Nelson,et al.  Neural models of delay discounting. , 2010 .

[94]  W B Levy,et al.  A sequence predicting CA3 is a flexible associator that learns and uses context to solve hippocampal‐like tasks , 1996, Hippocampus.

[95]  R. F. Thompson,et al.  Hippocampus and trace conditioning of the rabbit's classically conditioned nictitating membrane response. , 1986, Behavioral neuroscience.

[96]  G. Ainslie Breakdown of will , 2001 .

[97]  R. Church,et al.  Scalar expectancy theory and choice between delayed rewards. , 1988, Psychological review.

[98]  Nikolaus R. McFarland,et al.  Striatonigrostriatal Pathways in Primates Form an Ascending Spiral from the Shell to the Dorsolateral Striatum , 2000, The Journal of Neuroscience.

[99]  Joel L. Davis,et al.  Macro-organization of the Circuits Connecting the Basal Ganglia with the Cortical Motor Areas , 1994 .

[100]  Richard Bellman,et al.  ON A ROUTING PROBLEM , 1958 .

[101]  Samuel M. McClure,et al.  Separate Neural Systems Value Immediate and Delayed Monetary Rewards , 2004, Science.

[102]  E. Bullmore,et al.  Society for Neuroscience Abstracts , 1997 .

[103]  W. Schultz,et al.  Preferential activation of midbrain dopamine neurons by appetitive rather than aversive stimuli , 1996, Nature.

[104]  R. Wightman,et al.  Extinction of Cocaine Self-Administration Reveals Functionally and Temporally Distinct Dopaminergic Signals in the Nucleus Accumbens , 2005, Neuron.

[105]  Richard S. Sutton,et al.  Stimulus Representation and the Timing of Reward-Prediction Errors in Models of the Dopamine System , 2008, Neural Computation.

[106]  K. Doya Metalearning, neuromodulation, and emotion , 2000 .

[107]  David Self Neurobiology: Dopamine as chicken and egg , 2003, Nature.

[108]  S. Mahadevan,et al.  Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning , 1999 .

[109]  Peter Dayan,et al.  Motivated Reinforcement Learning , 2001, NIPS.

[110]  A. Dickinson,et al.  Reward-related signals carried by dopamine neurons. , 1995 .

[111]  Alexandre Pouget,et al.  Probabilistic Interpretation of Population Codes , 1996, Neural Computation.

[112]  R. Wightman,et al.  Subsecond dopamine release promotes cocaine seeking , 2003, Nature.

[113]  J. O'Doherty,et al.  Reward representations and reward-related learning in the human brain: insights from neuroimaging , 2004, Current Opinion in Neurobiology.

[114]  R. Wightman,et al.  Dopamine Operates as a Subsecond Modulator of Food Seeking , 2004, The Journal of Neuroscience.

[115]  Saori C. Tanaka,et al.  Low-Serotonin Levels Increase Delayed Reward Discounting in Humans , 2008, The Journal of Neuroscience.

[116]  D. Rubin,et al.  One Hundred Years of Forgetting : A Quantitative Description of Retention , 1996 .

[117]  P. Dayan,et al.  Dopamine, uncertainty and TD learning , 2005, Behavioral and Brain Functions.

[118]  B. Reynolds A review of delay-discounting research with humans: relations to drug use and gambling , 2006, Behavioural pharmacology.

[119]  A. David Redish,et al.  Measuring distributed properties of neural representations beyond the decoding of local variables: Implications for cognition , 2008 .

[120]  C. Pennartz,et al.  Is a bird in the hand worth two in the future? The neuroeconomics of intertemporal decision-making , 2008, Progress in Neurobiology.