Optimizing the depth and the direction of prospective planning using information values

Evaluating the future consequences of actions is achievable by simulating a mental search tree into the future. Expanding deep trees, however, is computationally taxing. Therefore, machines and humans use a plan-until-habit scheme that simulates the environment up to a limited depth and then exploits habitual values as proxies for consequences that may arise in the future. Two outstanding questions in this scheme are “in which directions the search tree should be expanded?”, and “when should the expansion stop?”. Here we propose a principled solution to these questions based on a speed/accuracy tradeoff: deeper expansion in the appropriate directions leads to more accurate planning, but at the cost of slower decision-making. Our simulation results show how this algorithm expands the search tree effectively and efficiently in a grid-world environment. We further show that our algorithm can explain several behavioral patterns in animals and humans, namely the effect of time-pressure on the depth of planning, the effect of reward magnitudes on the direction of planning, and the gradual shift from goal-directed to habitual behavior over the course of training. The algorithm also provides several predictions testable in animal/human experiments.

[1]  S. Killcross,et al.  Coordination of actions and habits in the medial prefrontal cortex of rats. , 2003, Cerebral cortex.

[2]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[3]  R. Boakes,et al.  Motivational control after extended instrumental training , 1995 .

[4]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[5]  B. Balleine,et al.  Human and Rodent Homologies in Action Control: Corticostriatal Determinants of Goal-Directed and Habitual Action , 2010, Neuropsychopharmacology.

[6]  P. Dayan,et al.  Adaptive integration of habits into depth-limited planning defines a habitual-goal–directed spectrum , 2016, Proceedings of the National Academy of Sciences.

[7]  N. Cowan The magical number 4 in short-term memory: A reconsideration of mental storage capacity , 2001, Behavioral and Brain Sciences.

[8]  Matthieu Geist,et al.  Kalman Temporal Differences , 2010, J. Artif. Intell. Res..

[9]  Stuart J. Russell,et al.  Do the right thing - studies in limited rationality , 1991 .

[10]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[11]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[12]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[13]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[14]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[15]  David Tolpin,et al.  Selecting Computations: Theory and Applications , 2012, UAI.

[16]  B. Balleine,et al.  Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning , 2004, The European journal of neuroscience.

[17]  Satrajit S. Ghosh,et al.  Mindboggling morphometry of human brains , 2016, bioRxiv.

[18]  Peter Dayan,et al.  Interplay of approximate planning strategies , 2015, Proceedings of the National Academy of Sciences.

[19]  Peter Dayan,et al.  Bonsai Trees in Your Head: How the Pavlovian System Sculpts Goal-Directed Choices by Pruning Decision Trees , 2012, PLoS Comput. Biol..

[20]  B. Balleine,et al.  Habits, action sequences and reinforcement learning , 2012, The European journal of neuroscience.

[21]  P. Dayan,et al.  Model-based influences on humans’ choices and striatal prediction errors , 2011, Neuron.

[22]  Amir Dezfouli,et al.  Speed/Accuracy Trade-Off between the Habitual and the Goal-Directed Processes , 2011, PLoS Comput. Biol..

[23]  W. Ma,et al.  Changing concepts of working memory , 2014, Nature Neuroscience.

[24]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[25]  Peter Dayan,et al.  Serotonin, Inhibition, and Negative Mood , 2007, PLoS Comput. Biol..

[26]  Russell Greiner,et al.  The Budgeted Multi-armed Bandit Problem , 2004, COLT.

[27]  P. Holland Relations between Pavlovian-instrumental transfer and reinforcer devaluation. , 2004, Journal of experimental psychology. Animal behavior processes.

[28]  Mehdi Keramati,et al.  Flexibility to contingency changes distinguishes habitual and goal-directed strategies in humans , 2017, bioRxiv.

[29]  Stuart J. Russell,et al.  Do the right thing , 1991 .