Reinforcement Learning and Episodic Memory in Humans and Animals: An Integrative Framework

&NA; We review the psychology and neuroscience of reinforcement learning (RL), which has experienced significant progress in the past two decades, enabled by the comprehensive experimental study of simple learning and decision‐making tasks. However, one challenge in the study of RL is computational: The simplicity of these tasks ignores important aspects of reinforcement learning in the real world: (a) State spaces are high‐dimensional, continuous, and partially observable; this implies that (b) data are relatively sparse and, indeed, precisely the same situation may never be encountered twice; furthermore, (c) rewards depend on the long‐term consequences of actions in ways that violate the classical assumptions that make RL tractable. A seemingly distinct challenge is that, cognitively, theories of RL have largely involved procedural and semantic memory, the way in which knowledge about action values or world models extracted gradually from many experiences can drive choice. This focus on semantic memory leaves out many aspects of memory, such as episodic memory, related to the traces of individual events. We suggest that these two challenges are related. The computational challenge can be dealt with, in part, by endowing RL systems with episodic memory, allowing them to (a) efficiently approximate value functions over complex state spaces, (b) learn with very little data, and (c) bridge long‐term dependencies between actions and rewards. We review the computational theory underlying this proposal and the empirical evidence to support it. Our proposal suggests that the ubiquitous and diverse roles of memory in RL may function as part of an integrated learning system.

[1]  W. Brogden Sensory pre-conditioning. , 1939 .

[2]  E. Tolman Cognitive maps in rats and men. , 1948, Psychological review.

[3]  A. Tversky,et al.  Prospect theory: analysis of decision under risk , 1979 .

[4]  R. Passingham The hippocampus as a cognitive map J. O'Keefe & L. Nadel, Oxford University Press, Oxford (1978). 570 pp., £25.00 , 1979, Neuroscience.

[5]  Christopher D. Adams Variations in the Sensitivity of Instrumental Responding to Reinforcer Devaluation , 1982 .

[6]  John G. Lynch,et al.  Memory and Attentional Factors in Consumer Choice: Concepts and Research Methods , 1982 .

[7]  R. Nosofsky Attention, similarity, and the identification-categorization relationship. , 1986, Journal of experimental psychology. General.

[8]  Dale T. Miller,et al.  Norm theory: Comparing reality to its alternatives , 1986 .

[9]  Christopher K. Riesbeck,et al.  Inside Case-Based Reasoning , 1989 .

[10]  G. E. Alexander,et al.  Functional architecture of basal ganglia circuits: neural substrates of parallel processing , 1990, Trends in Neurosciences.

[11]  P. Nedungadi Recall and Consumer Consideration Sets: Influencing Choice without Altering Brand Evaluations , 1990 .

[12]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[13]  L. Squire Memory and the hippocampus: a synthesis from findings with rats, monkeys, and humans. , 1992, Psychological review.

[14]  J. Kruschke,et al.  ALCOVE: an exemplar-based connectionist model of category learning. , 1992, Psychological review.

[15]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[16]  A. Tversky,et al.  Choice in Context: Tradeoff Contrast and Extremeness Aversion , 1992 .

[17]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[18]  Joel L. Davis,et al.  A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[19]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[20]  Jennifer A. Mangels,et al.  A Neostriatal Habit Learning System in Humans , 1996, Science.

[21]  J. D. McGaugh,et al.  Inactivation of Hippocampus or Caudate Nucleus with Lidocaine Differentially Affects Expression of Place and Response Learning , 1996, Neurobiology of Learning and Memory.

[22]  B. McNaughton,et al.  Replay of Neuronal Firing Sequences in Rat Hippocampus During Sleep Following Spatial Experience , 1996, Science.

[23]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[24]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[25]  J. Gabrieli Cognitive neuroscience of human memory. , 1998, Annual review of psychology.

[26]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[27]  David J. Foster,et al.  A model of hippocampally dependent navigation, using the temporal difference learning rule , 2000, Hippocampus.

[28]  Michael Kearns,et al.  Bias-Variance Error Bounds for Temporal Difference Updates , 2000, COLT.

[29]  Itzhak Gilboa,et al.  A theory of case-based decisions , 2001 .

[30]  N. Cohen From Conditioning to Conscious Recollection Memory Systems of the Brain. Oxford Psychology Series, Volume 35. , 2001 .

[31]  M. Gluck,et al.  Interactive memory systems in the human brain , 2001, Nature.

[32]  Jonathan D. Cohen,et al.  Computational perspectives on dopamine function in prefrontal cortex , 2002, Current Opinion in Neurobiology.

[33]  B. Balleine,et al.  The Role of Learning in the Operation of Motivational Systems , 2002 .

[34]  Cleotilde Gonzalez,et al.  Instance-based learning in dynamic decision making , 2003, Cogn. Sci..

[35]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[36]  I. Erev,et al.  Small feedback‐based decisions and their limited correspondence to description‐based decisions , 2003 .

[37]  Thomas Gärtner,et al.  Kernels and Distances for Structured Data , 2004, Machine Learning.

[38]  Michael J. Frank,et al.  By Carrot or by Stick: Cognitive Reinforcement Learning in Parkinsonism , 2004, Science.

[39]  A. Redish,et al.  Addiction as a Computational Process Gone Awry , 2004, Science.

[40]  D. Medin,et al.  SUSTAIN: a network model of category learning. , 2004, Psychological review.

[41]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[42]  Gordon D. A. Brown,et al.  Absolute identification by relative judgment. , 2005, Psychological review.

[43]  M. Gluck,et al.  The role of dopamine in cognitive sequence learning: evidence from Parkinson’s disease , 2005, Behavioural Brain Research.

[44]  T. Robbins,et al.  Neural systems of reinforcement for drug addiction: from actions to habits to compulsion , 2005, Nature Neuroscience.

[45]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[46]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[47]  P. Glimcher,et al.  Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal , 2005, Neuron.

[48]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[49]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[50]  P. Glimcher,et al.  JOURNAL OF THE EXPERIMENTAL ANALYSIS OF BEHAVIOR 2005, 84, 555–579 NUMBER 3(NOVEMBER) DYNAMIC RESPONSE-BY-RESPONSE MODELS OF MATCHING BEHAVIOR IN RHESUS MONKEYS , 2022 .

[51]  George Loewenstein,et al.  Mistake #37: The Effect of Previously Encountered Prices on Current Housing Demand , 2006 .

[52]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[53]  Michael J. Frank,et al.  Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia , 2006, Neural Computation.

[54]  R. Dolan,et al.  Dopamine-dependent prediction errors underpin reward-seeking behaviour in humans , 2006, Nature.

[55]  Gordon D. A. Brown,et al.  Decision by sampling , 2006, Cognitive Psychology.

[56]  David S. Touretzky,et al.  Representation and Timing in Theories of the Dopamine System , 2006, Neural Computation.

[57]  A. Tversky,et al.  Prospect theory: an analysis of decision under risk — Source link , 2007 .

[58]  Adam Johnson,et al.  Neural Ensembles in CA3 Transiently Encode Paths Forward of the Animal at a Decision Point , 2007, The Journal of Neuroscience.

[59]  Peter Dayan,et al.  Hippocampal Contributions to Control: The Third Way , 2007, NIPS.

[60]  Jonathan D. Cohen,et al.  On the Control of Control: The Role of Dopamine in Regulating Prefrontal Function and Working Memory , 2007 .

[61]  Sridhar Mahadevan,et al.  Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[62]  Timothy J. Pleskac,et al.  The Description-Experience Gap in Risky Choice: The Role of Sample Size and Experienced Probabilities , 2008 .

[63]  Kent Berridge,et al.  Faculty Opinions recommendation of Dissociating the role of the orbitofrontal cortex and the striatum in the computation of goal values and prediction errors. , 2008 .

[64]  D. Shohamy,et al.  Integrating Memories in the Human Brain: Hippocampal-Midbrain Encoding of Overlapping Events , 2008, Neuron.

[65]  Richard S. Sutton,et al.  Stimulus Representation and the Timing of Reward-Prediction Errors in Models of the Dopamine System , 2008, Neural Computation.

[66]  Colin Camerer,et al.  Dissociating the Role of the Orbitofrontal Cortex and the Striatum in the Computation of Goal Values and Prediction Errors , 2008, The Journal of Neuroscience.

[67]  Jonathan D. Cohen,et al.  Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement , 2008, NIPS.

[68]  E. Yechiam,et al.  Loss aversion, diminishing sensitivity, and the effect of experience on repeated decisions† , 2008 .

[69]  John R. Anderson,et al.  Solving the credit assignment problem: explicit and implicit learning of action sequences with probabilistic outcomes , 2008, Psychological research.

[70]  Eric A. Zilli,et al.  Modeling the role of working memory and episodic memory in behavioral tasks , 2008, Hippocampus.

[71]  M. Botvinick,et al.  Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective , 2009, Cognition.

[72]  B. McNaughton,et al.  Hippocampus Leads Ventral Striatum in Replay of Place-Reward Information , 2009, PLoS biology.

[73]  Y. Niv Reinforcement learning in the brain , 2009 .

[74]  R. Hertwig,et al.  The description–experience gap in risky choice , 2009, Trends in Cognitive Sciences.

[75]  Demis Hassabis,et al.  The construction system of the brain , 2009, Philosophical Transactions of the Royal Society B: Biological Sciences.

[76]  B. Schölkopf,et al.  Does Cognitive Science Need Kernels? , 2009, Trends in Cognitive Sciences.

[77]  I. Erev,et al.  Learning, risk attitude and hot stoves in restless bandit problems , 2009 .

[78]  D. Blei,et al.  Context, learning, and extinction. , 2010, Psychological review.

[79]  Rajesh P. N. Rao Decision Making Under Uncertainty: A Neural Model Based on Partially Observable Markov Decision Processes , 2010, Front. Comput. Neurosci..

[80]  P. Dayan,et al.  States versus Rewards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning , 2010, Neuron.

[81]  Nathaniel D. Daw,et al.  Selective impairment of prediction error signaling in human dorsolateral but not ventral striatum in Parkinson's disease patients: evidence from a model-based fMRI study , 2010, NeuroImage.

[82]  J. Tenenbaum,et al.  Probabilistic models of cognition: exploring representations and inductive biases , 2010, Trends in Cognitive Sciences.

[83]  Varun Dutt,et al.  Instance-based learning: integrating sampling and repeated decisions from experience. , 2011, Psychological review.

[84]  Matthijs A. A. van der Meer,et al.  Theta Phase Precession in Rat Ventral Striatum Links Place and Reward Information , 2011, The Journal of Neuroscience.

[85]  Nathaniel D. Daw,et al.  Grid Cells, Place Cells, and Geodesic Generalization for Spatial Reinforcement Learning , 2011, PLoS Comput. Biol..

[86]  Margaret F. Carr,et al.  Hippocampal replay in the awake state: a potential substrate for memory consolidation and retrieval , 2011, Nature Neuroscience.

[87]  P. Dayan,et al.  Model-based influences on humans’ choices and striatal prediction errors , 2011, Neuron.

[88]  Amir Dezfouli,et al.  Speed/Accuracy Trade-Off between the Habitual and the Goal-Directed Processes , 2011, PLoS Comput. Biol..

[89]  T. Robbins,et al.  The hippocampal–striatal axis in learning, prediction and goal-directed behavior , 2011, Trends in Neurosciences.

[90]  Katherine R. Sherrill,et al.  The hippocampus is functionally connected to the striatum and orbitofrontal cortex during context dependent decision making , 2011, Brain Research.

[91]  L. Davachi,et al.  What Constitutes an Episode in Episodic Memory? , 2011, Psychological science.

[92]  Anne E Carpenter,et al.  Neuron-type specific signals for reward and punishment in the ventral tegmental area , 2011, Nature.

[93]  P. Dayan,et al.  Serotonin Selectively Modulates Reward Value in Human Decision-Making , 2012, The Journal of Neuroscience.

[94]  R. N. Spreng,et al.  The Future of Memory: Remembering, Imagining, and the Brain , 2012, Neuron.

[95]  D. Shohamy,et al.  Preference by Association: How Memory Mechanisms in the Hippocampus Bias Decisions , 2012, Science.

[96]  Chantal E. Stern,et al.  Cooperative interactions between hippocampal and striatal systems support flexible navigation , 2012, NeuroImage.

[97]  Anne G E Collins,et al.  How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis , 2012, The European journal of neuroscience.

[98]  Brad E. Pfeiffer,et al.  Hippocampal place cell sequences depict future paths to remembered goals , 2013, Nature.

[99]  Bernard W. Balleine,et al.  Actions, Action Sequences and Habits: Evidence That Goal-Directed and Habitual Action Control Are Hierarchically Organized , 2013, PLoS Comput. Biol..

[100]  N. Daw,et al.  Chapter 15 – Value Learning through Reinforcement: The Basics of Dopamine and Reinforcement Learning , 2013 .

[101]  Alice Y. Chiang,et al.  Working-memory capacity protects model-based learning from stress , 2013, Proceedings of the National Academy of Sciences.

[102]  Carlos Diuk,et al.  Hierarchical Learning Induces Two Simultaneous, But Separable, Prediction Errors in Human Basal Ganglia , 2013, The Journal of Neuroscience.

[103]  A. Markman,et al.  The Curse of Planning: Dissecting Multiple Reinforcement-Learning Systems by Taxing the Central Executive , 2013 .

[104]  P. Dayan,et al.  Goals and Habits in the Brain , 2013, Neuron.

[105]  Josiah R. Boivin,et al.  A Causal Link Between Prediction Errors, Dopamine Neurons and Learning , 2013, Nature Neuroscience.

[106]  P. Glimcher,et al.  Phasic Dopamine Release in the Rat Nucleus Accumbens Symmetrically Encodes a Reward Prediction Error Term , 2014, The Journal of Neuroscience.

[107]  Erin Kendall Braun,et al.  Episodic Memory Encoding Interferes with Reward Learning and Decreases Striatal Prediction Errors , 2014, The Journal of Neuroscience.

[108]  Samuel Gershman,et al.  Design Principles of the Hippocampal Cognitive Map , 2014, NIPS.

[109]  Marcia L. Spetch,et al.  Remembering the best and worst of times: Memories for extreme outcomes bias risky decisions , 2013, Psychonomic Bulletin & Review.

[110]  Anne G E Collins,et al.  Working Memory Contributions to Reinforcement Learning Impairments in Schizophrenia , 2014, The Journal of Neuroscience.

[111]  P. Dayan,et al.  The algorithmic anatomy of model-based evaluation , 2014, Philosophical Transactions of the Royal Society B: Biological Sciences.

[112]  Thomas L. Griffiths,et al.  The high availability of extreme events serves resource-rational decision-making , 2014, CogSci.

[113]  A. Markman,et al.  Journal of Experimental Psychology : General Retrospective Revaluation in Sequential Decision Making : A Tale of Two Systems , 2012 .

[114]  Shinsuke Shimojo,et al.  Neural Computations Underlying Arbitration between Model-Based and Model-free Learning , 2013, Neuron.

[115]  N. Daw,et al.  Multiple Systems for Value Learning , 2014 .

[116]  N. Daw Advanced Reinforcement Learning , 2014 .

[117]  Dylan A. Simon,et al.  Model-based choices involve prospective neural activity , 2015, Nature Neuroscience.

[118]  N. Daw,et al.  Integrating memories to guide decisions , 2015, Current Opinion in Behavioral Sciences.

[119]  Peter Dayan,et al.  Interplay of approximate planning strategies , 2015, Proceedings of the National Academy of Sciences.

[120]  P. Dayan,et al.  Temporal structure in associative retrieval , 2015, eLife.

[121]  Samuel Gershman,et al.  Novelty and Inductive Generalization in Human Reinforcement Learning , 2015, Top. Cogn. Sci..

[122]  M. Botvinick,et al.  Evidence integration in model-based tree search , 2015, Proceedings of the National Academy of Sciences.

[123]  Y. Niv,et al.  Discovering latent causes in reinforcement learning , 2015, Current Opinion in Behavioral Sciences.

[124]  F. Cushman,et al.  Habitual control of goal selection in humans , 2015, Proceedings of the National Academy of Sciences.

[125]  Robert C. Wilson,et al.  Reinforcement Learning in Multidimensional Environments Relies on Attention Mechanisms , 2015, The Journal of Neuroscience.

[126]  Lesley K Fellows,et al.  Ventromedial Frontal Cortex Is Critical for Guiding Attention to Reward-Predictive Visual Features in Humans , 2015, The Journal of Neuroscience.

[127]  Marcia L Spetch,et al.  Priming memories of past wins induces risk seeking. , 2015, Journal of experimental psychology. General.

[128]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[129]  Ilana B. Witten,et al.  Reward and choice encoding in terminals of midbrain dopamine neurons depends on striatal target , 2016, Nature Neuroscience.

[130]  Geoffrey Schoenbaum,et al.  Midbrain dopamine neurons compute inferred and cached value prediction errors in a common framework , 2016, eLife.

[131]  Lindsay E. Hunter,et al.  Episodic memories predict adaptive value-based decision-making. , 2016, Journal of experimental psychology. General.

[132]  N. Daw,et al.  Characterizing a psychiatric symptom dimension related to deficits in goal-directed control , 2016, eLife.

[133]  N. Daw,et al.  What’s past is present: Reminders of past choices bias decisions for reward in humans , 2017, bioRxiv.