Simple Plans or Sophisticated Habits? State, Transition and Learning Interactions in the Two-step Task

The recently developed ‘two-step’ behavioural task promises to differentiate model-based or goal-directed from model-free or habitual reinforcement learning, while generating neurophysiologically-friendly decision datasets with parametric variation of decision variables. These desirable features have prompted widespread adoption of the task. However, the signatures of model-based control can be elusive – here, we investigate model-free learning methods that, depending on the analysis strategy, can masquerade as being model-based. We first show that unadorned model-free reinforcement learning can induce correlations between action values at the start of the trial and the subsequent trial events in such a way that analysis based on comparing successive trials can lead to erroneous conclusions. We also suggest a correction to the analysis that can alleviate this problem. We then consider model-free reinforcement learning strategies based on different state representations from those envisioned by the experimenter, which generate behaviour that appears model-based under these, and also more sophisticated, analyses. The existence of such strategies is of particular relevance to the design and interpretation of animal studies using the two-step task, as extended training and a sharp contrast between good and bad options are likely to promote their use. Author Summary Planning is the use of a predictive model of the consequences of actions to guide decision making. Planning plays a critical role in human behaviour but isolating its contribution is challenging because it is complemented by control systems which learn values of actions directly from the history of reinforcement, resulting in automatized mappings from states to actions often termed habits. Our study examined a recently developed behavioural task which uses choices in a multi-step decision tree to differentiate planning from value-based control. Using simulation, we demonstrated the existence of strategies which produce behaviour that resembles planning but in fact arises as a fixed mapping from particular sorts of states to actions. These results show that when a planning problem is faced repeatedly, sophisticated automatization strategies may be developed which identify that there are in fact a limited number of relevant states of the world each with an appropriate fixed or habitual response. Understanding such strategies is important for the design and interpretation of tasks which aim to isolate the contribution of planning to behaviour. Such strategies are also of independent scientific interest as they may contribute to automatization of behaviour in complex environments.

[1]  Christopher D. Adams,et al.  Instrumental Responding following Reinforcer Devaluation , 1981 .

[2]  Christopher D. Adams,et al.  The Effect of the Instrumental Training Contingency on Susceptibility to Reinforcer Devaluation , 1983 .

[3]  A. Dickinson Actions and habits: the development of behavioural autonomy , 1985 .

[4]  R. Rescorla,et al.  Postconditioning devaluation of a reinforcer affects instrumental responding. , 1985 .

[5]  William T. Newsome,et al.  Cortical microstimulation influences perceptual judgements of motion direction , 1990, Nature.

[6]  B. Balleine,et al.  Goal-directed instrumental action: contingency and incentive learning and their cortical substrates , 1998, Neuropharmacology.

[7]  Z. Mainen,et al.  Speed and accuracy of olfactory discrimination in the rat , 2003, Nature Neuroscience.

[8]  B. Balleine,et al.  The Effect of Lesions of the Basolateral Amygdala on Instrumental Conditioning , 2003, The Journal of Neuroscience.

[9]  B. Balleine,et al.  The role of prelimbic cortex in instrumental conditioning , 2003, Behavioural Brain Research.

[10]  S. Killcross,et al.  Inactivation of the infralimbic prefrontal cortex reinstates goal-directed responding in overtrained rats , 2003, Behavioural Brain Research.

[11]  S. Killcross,et al.  Coordination of actions and habits in the medial prefrontal cortex of rats. , 2003, Cerebral cortex.

[12]  B. Balleine,et al.  Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning , 2004, The European journal of neuroscience.

[13]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[14]  B. Balleine,et al.  Lesions of Medial Prefrontal Cortex Disrupt the Acquisition But Not the Expression of Goal-Directed Learning , 2005, The Journal of Neuroscience.

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  B. Balleine,et al.  The role of the dorsomedial striatum in instrumental conditioning , 2005, The European journal of neuroscience.

[17]  B. Balleine,et al.  Blockade of NMDA receptors in the dorsomedial striatum prevents action–outcome learning in instrumental conditioning , 2005, The European journal of neuroscience.

[18]  J. O'Doherty,et al.  The Role of the Ventromedial Prefrontal Cortex in Abstract State-Based Inference during Decision Making in Humans , 2006, The Journal of Neuroscience.

[19]  B. Balleine,et al.  Inactivation of dorsolateral striatum enhances sensitivity to changes in the action–outcome contingency in instrumental conditioning , 2006, Behavioural Brain Research.

[20]  M. Botvinick,et al.  Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective , 2009, Cognition.

[21]  Y. Niv,et al.  Learning latent structure: carving nature at its joints , 2010, Current Opinion in Neurobiology.

[22]  P. Dayan,et al.  States versus Rewards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning , 2010, Neuron.

[23]  Dylan A. Simon,et al.  Neural Correlates of Forward Planning in a Spatial Decision Task in Humans , 2011, The Journal of Neuroscience.

[24]  P. Dayan,et al.  Model-based influences on humans’ choices and striatal prediction errors , 2011, Neuron.

[25]  Peter Dayan,et al.  Bonsai Trees in Your Head: How the Pavlovian System Sculpts Goal-Directed Choices by Pruning Decision Trees , 2012, PLoS Comput. Biol..

[26]  P. Dayan,et al.  Mapping value based planning and extensively trained choice in the human brain , 2012, Nature Neuroscience.

[27]  Xin Jin,et al.  Different dorsal striatum circuits mediate action discrimination and action generalization , 2012, The European journal of neuroscience.

[28]  R. Dolan,et al.  Dopamine Enhances Model-Based over Model-Free Choice Behavior , 2012, Neuron.

[29]  Shu-Chen Li,et al.  Of goals and habits: age-related and individual differences in goal-directed decision-making , 2013, Front. Neurosci..

[30]  Rui Costa,et al.  Premotor cortex is critical for goal-directed actions , 2013, Front. Comput. Neurosci..

[31]  A. Zador,et al.  Corticostriatal neurones in auditory cortex drive decisions during auditory discrimination , 2013, Nature.

[32]  Bernard W. Balleine,et al.  Actions, Action Sequences and Habits: Evidence That Goal-Directed and Habitual Action Control Are Hierarchically Organized , 2013, PLoS Comput. Biol..

[33]  R. Costa,et al.  Orbitofrontal and striatal circuits dynamically encode the shift between goal-directed and habitual actions , 2013, Nature Communications.

[34]  Bingni W. Brunton,et al.  Rats and Humans Can Optimally Accumulate Evidence for Decision-Making , 2013, Science.

[35]  N. Daw,et al.  Extraversion differentiates between model-based and model-free strategies in a reinforcement learning task , 2013, Front. Hum. Neurosci..

[36]  Alice Y. Chiang,et al.  Working-memory capacity protects model-based learning from stress , 2013, Proceedings of the National Academy of Sciences.

[37]  A. Markman,et al.  The Curse of Planning: Dissecting Multiple Reinforcement-Learning Systems by Taxing the Central Executive , 2013 .

[38]  Giovanni Pezzulo,et al.  The Mixed Instrumental Controller: Using Value of Information to Combine Habitual Choice and Mental Simulation , 2013, Front. Psychol..

[39]  P. Dayan,et al.  Goals and Habits in the Brain , 2013, Neuron.

[40]  Thomas H. B. FitzGerald,et al.  Disruption of Dorsolateral Prefrontal Cortex Decreases Model-Based in Favor of Model-free Control in Humans , 2013, Neuron.

[41]  Robert C. Wilson,et al.  Orbitofrontal Cortex as a Cognitive Map of Task Space , 2014, Neuron.

[42]  Miriam Sebold,et al.  Processing speed enhances model-based over model-free reinforcement learning in the presence of high working memory functioning , 2014, Front. Psychol..

[43]  L. Deserno,et al.  Model-Based and Model-Free Decisions in Alcohol Dependence , 2014, Neuropsychobiology.

[44]  Thomas H. B. FitzGerald,et al.  Transcranial Direct Current Stimulation of Right Dorsolateral Prefrontal Cortex Does Not Affect Model-Based or Model-Free Reinforcement Learning in Humans , 2014, PloS one.

[45]  P. Dayan,et al.  Disorders of compulsivity: a common bias towards learning habits , 2014, Molecular Psychiatry.

[46]  Peter Dayan,et al.  Interplay of approximate planning strategies , 2015, Proceedings of the National Academy of Sciences.

[47]  R. Dolan,et al.  Ventral striatal dopamine reflects behavioral and neural signatures of model-based control during sequential decision making , 2015, Proceedings of the National Academy of Sciences.

[48]  A. Villringer,et al.  The interaction of acute and chronic stress impairs model-based behavioral control , 2015, Psychoneuroendocrinology.

[49]  N. Daw,et al.  Cognitive Control Predicts Use of Model-based Reinforcement Learning , 2014, Journal of Cognitive Neuroscience.

[50]  Vincent D Costa,et al.  Reversal Learning and Dopamine: A Bayesian Perspective , 2015, The Journal of Neuroscience.