Reward-Based Learning, Model-Based and Model-Free

Quentin J. M. Huys*, Anthony Cruickshank and Peggy Series Translational Neuromodeling Unit, Institute of Biomedical Engineering, ETH Z€urich and University of Z€urich, Z€urich, Switzerland Department of Psychiatry, Psychotherapy and Psychosomatics, Hospital of Psychiatry, University of Z€urich, Z€urich, Switzerland Institute of Adaptive and Neural Computation, University of Edinburgh, Edinburgh, UK

[1]  M. Roesch,et al.  Dopamine neurons encode the better option in rats deciding between differently delayed or sized rewards , 2007, Nature Neuroscience.

[2]  B. Balleine,et al.  Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning , 2004, The European journal of neuroscience.

[3]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[4]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[5]  K. Berridge,et al.  Instant Transformation of Learned Repulsion into Motivational “Wanting” , 2013, Current Biology.

[6]  T. Robbins,et al.  Neurocognitive endophenotypes of impulsivity and compulsivity: towards dimensional psychiatry , 2012, Trends in Cognitive Sciences.

[7]  Peter Dayan,et al.  Bonsai Trees in Your Head: How the Pavlovian System Sculpts Goal-Directed Choices by Pruning Decision Trees , 2012, PLoS Comput. Biol..

[8]  Donald E. Knuth,et al.  The Solution for the Branching Factor of the Alpha-Beta Pruning Algorithm , 1981, ICALP.

[9]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[10]  P. Dayan,et al.  Model-based and model-free Pavlovian reward learning: Revaluation, revision, and revelation , 2014, Cognitive, affective & behavioral neuroscience.

[11]  P. Dayan,et al.  Behavioral/systems/cognitive Action Dominates Valence in Anticipatory Representations in the Human Striatum and Dopaminergic Midbrain , 2010 .

[12]  Josiah R. Boivin,et al.  A Causal Link Between Prediction Errors, Dopamine Neurons and Learning , 2013, Nature Neuroscience.

[13]  L. Kamin Predictability, surprise, attention, and conditioning , 1967 .

[14]  W. Schultz,et al.  Dopamine responses comply with basic assumptions of formal learning theory , 2001, Nature.

[15]  W. Schultz,et al.  Dopamine neurons of the monkey midbrain: contingencies of responses to stimuli eliciting immediate behavioral reactions. , 1990, Journal of neurophysiology.

[16]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[17]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[18]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[19]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[20]  Raymond J. Dolan,et al.  Disentangling the Roles of Approach, Activation and Valence in Instrumental and Pavlovian Responding , 2011, PLoS Comput. Biol..

[21]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[22]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[23]  B. Balleine,et al.  Role of cholecystokinin in the motivational control of instrumental action in rats. , 1994, Behavioral neuroscience.

[24]  T. Robbins,et al.  Effects of selective excitotoxic lesions of the nucleus accumbens core, anterior cingulate cortex, and central nucleus of the amygdala on autoshaping performance in rats. , 2002, Behavioral neuroscience.

[25]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[26]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[27]  M. Frank,et al.  From reinforcement learning models to psychiatric and neurological disorders , 2011, Nature Neuroscience.

[28]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[29]  J. O'Doherty,et al.  The Role of the Ventromedial Prefrontal Cortex in Abstract State-Based Inference during Decision Making in Humans , 2006, The Journal of Neuroscience.

[30]  L. Deserno,et al.  Model-Based and Model-Free Decisions in Alcohol Dependence , 2014, Neuropsychobiology.

[31]  Kyle S. Smith,et al.  A Dual Operator View of Habitual Behavior Reflecting Cortical and Striatal Dynamics , 2013, Neuron.

[32]  P. Glimcher,et al.  Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal , 2005, Neuron.

[33]  P. Tobler,et al.  The role of learning-related dopamine signals in addiction vulnerability. , 2014, Progress in brain research.

[34]  Murray Campbell,et al.  Deep Blue , 2002, Artif. Intell..

[35]  Michael J. Frank,et al.  By Carrot or by Stick: Cognitive Reinforcement Learning in Parkinsonism , 2004, Science.

[36]  T. Robinson,et al.  A selective role for dopamine in reward learning , 2010, Nature.

[37]  M. Kahana,et al.  Human Substantia Nigra Neurons Encode Unexpected Financial Rewards , 2009, Science.

[38]  N. Daw,et al.  Characterizing a psychiatric symptom dimension related to deficits in goal-directed control , 2016, eLife.

[39]  P. Dayan,et al.  Model-based influences on humans’ choices and striatal prediction errors , 2011, Neuron.

[40]  P. Dayan,et al.  States versus Rewards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning , 2010, Neuron.

[41]  M. Roesch,et al.  A new perspective on the role of the orbitofrontal cortex in adaptive behaviour , 2009, Nature Reviews Neuroscience.

[42]  Samuel M. McClure,et al.  BOLD Responses Reflecting Dopaminergic Signals in the Human Ventral Tegmental Area , 2008, Science.

[43]  B. Balleine,et al.  The role of the dorsomedial striatum in instrumental conditioning , 2005, The European journal of neuroscience.

[44]  W. Schultz,et al.  Adaptive Coding of Reward Value by Dopamine Neurons , 2005, Science.

[45]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[46]  M. Botvinick,et al.  The successor representation in human reinforcement learning , 2016, bioRxiv.

[47]  M. Bouton Learning and Behavior: A Contemporary Synthesis , 2006 .

[48]  B. Campbell,et al.  Punishment and aversive behavior , 1969 .

[49]  Further Particulars GATSBY COMPUTATIONAL NEUROSCIENCE UNIT , 2003 .

[50]  S. Killcross,et al.  Coordination of actions and habits in the medial prefrontal cortex of rats. , 2003, Cerebral cortex.

[51]  E. Vaadia,et al.  Midbrain dopamine neurons encode decisions for future action , 2006, Nature Neuroscience.

[52]  R. Wightman,et al.  Associative learning mediates dynamic shifts in dopamine signaling in the nucleus accumbens , 2007, Nature Neuroscience.

[53]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[54]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[55]  B. Skinner,et al.  Principles of Behavior , 1944 .

[56]  X. Zhuang,et al.  Faculty Opinions recommendation of A selective role for dopamine in stimulus-reward learning. , 2010 .

[57]  Amir Dezfouli,et al.  Speed/Accuracy Trade-Off between the Habitual and the Goal-Directed Processes , 2011, PLoS Comput. Biol..

[58]  T. Robbins,et al.  Enhanced Avoidance Habits in Obsessive-Compulsive Disorder , 2014, Biological Psychiatry.

[59]  T. Robbins,et al.  Disruption in the Balance Between Goal-Directed Behavior and Habit Learning in Obsessive-Compulsive Disorder , 2011, The American journal of psychiatry.

[60]  Peter Dayan,et al.  Non-commercial Research and Educational Use including without Limitation Use in Instruction at Your Institution, Sending It to Specific Colleagues That You Know, and Providing a Copy to Your Institution's Administrator. All Other Uses, Reproduction and Distribution, including without Limitation Comm , 2022 .

[61]  Brad E. Pfeiffer,et al.  Hippocampal place cell sequences depict future paths to remembered goals , 2013, Nature.

[62]  Y. Niv,et al.  Ventral Striatum and Orbitofrontal Cortex Are Both Required for Model-Based, But Not Model-Free, Reinforcement Learning , 2011, The Journal of Neuroscience.

[63]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[64]  E. Tolman Cognitive maps in rats and men. , 1948, Psychological review.

[65]  P. Dayan,et al.  Disorders of compulsivity: a common bias towards learning habits , 2014, Molecular Psychiatry.

[66]  Alex M. Andrew,et al.  ROBOT LEARNING, edited by Jonathan H. Connell and Sridhar Mahadevan, Kluwer, Boston, 1993/1997, xii+240 pp., ISBN 0-7923-9365-1 (Hardback, 218.00 Guilders, $120.00, £89.95). , 1999, Robotica (Cambridge. Print).

[67]  B. Balleine,et al.  Double Dissociation of Basolateral and Central Amygdala Lesions on the General and Outcome-Specific Forms of Pavlovian-Instrumental Transfer , 2005, The Journal of Neuroscience.

[68]  榎本 一紀 Dopamine neurons learn to encode the long-term value of multiple future rewards , 2011 .

[69]  S. Killcross,et al.  Amphetamine Exposure Enhances Habit Formation , 2006, The Journal of Neuroscience.

[70]  Samuel M. McClure,et al.  A computational substrate for incentive salience , 2003, Trends in Neurosciences.

[71]  R. Dolan,et al.  Dopamine Enhances Model-Based over Model-Free Choice Behavior , 2012, Neuron.

[72]  B. Balleine,et al.  The General and Outcome-Specific Forms of Pavlovian-Instrumental Transfer Are Differentially Mediated by the Nucleus Accumbens Core and Shell , 2011, The Journal of Neuroscience.

[73]  Adam Johnson,et al.  Computing motivation: Incentive salience boosts of drug or appetite states , 2008, Behavioral and Brain Sciences.

[74]  Vivian V. Valentin,et al.  Determining the Neural Substrates of Goal-Directed Learning in the Human Brain , 2007, The Journal of Neuroscience.

[75]  Adam Johnson,et al.  Neural Ensembles in CA3 Transiently Encode Paths Forward of the Animal at a Decision Point , 2007, The Journal of Neuroscience.

[76]  J. Mirenowicz,et al.  Dissociation of Pavlovian and instrumental incentive learning under dopamine antagonists. , 2000, Behavioral neuroscience.

[77]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[78]  P. Glimcher,et al.  Statistics of midbrain dopamine neuron spike trains in the awake primate. , 2007, Journal of neurophysiology.