Interplay of approximate planning strategies

Significance Many problems, particularly sequential planning problems, are computationally very demanding. How humans combine strategies to approximate and simplify these problems is not understood. Using modelling to unpick performance in a planning task, we find that humans are able to exploit the structure of the task to subdivide it and reduce processing requirements nearly optimally. Subtasks are combined in a simple, greedy manner, however, and within subtasks there is evidence of inhibitory reflexes in response to losses. Humans routinely formulate plans in domains so complex that even the most powerful computers are taxed. To do so, they seem to avail themselves of many strategies and heuristics that efficiently simplify, approximate, and hierarchically decompose hard tasks into simpler subtasks. Theoretical and cognitive research has revealed several such strategies; however, little is known about their establishment, interaction, and efficiency. Here, we use model-based behavioral analysis to provide a detailed examination of the performance of human subjects in a moderately deep planning task. We find that subjects exploit the structure of the domain to establish subgoals in a way that achieves a nearly maximal reduction in the cost of computing values of choices, but then combine partial searches with greedy local steps to solve subtasks, and maladaptively prune the decision trees of subtasks in a reflexive manner upon encountering salient losses. Subjects come idiosyncratically to favor particular sequences of actions to achieve subgoals, creating novel complex actions or “options.”

[1]  M. Gluck,et al.  Dopaminergic Drugs Modulate Learning Rates and Perseveration in Parkinson's Patients in a Dynamic Foraging Task , 2009, The Journal of Neuroscience.

[2]  T. Hergueta,et al.  The mini international neuropsychiatric interview , 1998, European Psychiatry.

[3]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[4]  Bernard W. Balleine,et al.  Actions, Action Sequences and Habits: Evidence That Goal-Directed and Habitual Action Control Are Hierarchically Organized , 2013, PLoS Comput. Biol..

[5]  H. Seo,et al.  Dynamic signals related to choices and outcomes in the dorsolateral prefrontal cortex. , 2007, Cerebral cortex.

[6]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[7]  P. Glimcher,et al.  JOURNAL OF THE EXPERIMENTAL ANALYSIS OF BEHAVIOR 2005, 84, 555–579 NUMBER 3(NOVEMBER) DYNAMIC RESPONSE-BY-RESPONSE MODELS OF MATCHING BEHAVIOR IN RHESUS MONKEYS , 2022 .

[8]  Amir Dezfouli,et al.  Speed/Accuracy Trade-Off between the Habitual and the Goal-Directed Processes , 2011, PLoS Comput. Biol..

[9]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[10]  B. Balleine,et al.  Evidence of Action Sequence Chunking in Goal-Directed Instrumental Conditioning and Its Dependence on the Dorsomedial Prefrontal Cortex , 2009, The Journal of Neuroscience.

[11]  J. Jonides,et al.  Evidence of hierarchies in cognitive maps , 1985, Memory & cognition.

[12]  C. I. Connolly,et al.  Building neural representations of habits. , 1999, Science.

[13]  John Duncan,et al.  Goal neglect and knowledge chunking in the construction of novel behaviour☆ , 2014, Cognition.

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  Teck-Hua Ho,et al.  Experience-Weighted Attraction Learning in Coordination Games: Probability Rules, Heterogeneity, and Time-Variation. , 1998, Journal of mathematical psychology.

[16]  Gordon D Logan,et al.  Cognitive Illusions of Authorship Reveal Hierarchical Error Detection in Skilled Typists , 2010, Science.

[17]  John Skvoretz,et al.  Node centrality in weighted networks: Generalizing degree and shortest paths , 2010, Soc. Networks.

[18]  P. Dayan,et al.  Serotonin Selectively Modulates Reward Value in Human Decision-Making , 2012, The Journal of Neuroscience.

[19]  DONALD MICHIE,et al.  “Memo” Functions and Machine Learning , 1968, Nature.

[20]  Peter Dayan,et al.  Bonsai Trees in Your Head: How the Pavlovian System Sculpts Goal-Directed Choices by Pruning Decision Trees , 2012, PLoS Comput. Biol..

[21]  Alicia P. Wolfe,et al.  Identifying useful subgoals in reinforcement learning by local graph partitioning , 2005, ICML.

[22]  Peter Dayan,et al.  Decision-Theoretic Psychiatry , 2015 .

[23]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[24]  B. Balleine,et al.  Habits, action sequences and reinforcement learning , 2012, The European journal of neuroscience.

[25]  P. Dayan,et al.  Model-based influences on humans’ choices and striatal prediction errors , 2011, Neuron.

[26]  Timothy J. O'Donnell,et al.  Compositional Policy Priors , 2013 .

[27]  Joshua B. Tenenbaum,et al.  Fragment Grammars: Exploring Computation and Reuse in Language , 2009 .

[28]  E. Koechlin,et al.  The Architecture of Cognitive Control in the Human Prefrontal Cortex , 2003, Science.

[29]  David Badre,et al.  Cognitive control, hierarchy, and the rostro–caudal organization of the frontal lobes , 2008, Trends in Cognitive Sciences.

[30]  Hanspeter A. Mallot,et al.  'Fine-to-Coarse' Route Planning and Navigation in Regionalized Environments , 2003, Spatial Cogn. Comput..

[31]  Timothy O'Donnell,et al.  Productivity and Reuse in Language: A Theory of Linguistic Computation and Storage , 2015 .

[32]  M. Botvinick,et al.  Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective , 2009, Cognition.

[33]  M. Botvinick,et al.  Neural representations of events arise from temporal community structure , 2013, Nature Neuroscience.

[34]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[35]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[36]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[37]  D. Sheehan,et al.  The Mini-International Neuropsychiatric Interview (M.I.N.I.): the development and validation of a structured diagnostic psychiatric interview for DSM-IV and ICD-10. , 1998, The Journal of clinical psychiatry.