A neural model of hierarchical reinforcement learning

We develop a novel, biologically detailed neural model of reinforcement learning (RL) processes in the brain. This model incorporates a broad range of biological features that pose challenges to neural RL, such as temporally extended action sequences, continuous environments involving unknown time delays, and noisy/imprecise computations. Most significantly, we expand the model into the realm of hierarchical reinforcement learning (HRL), which divides the RL process into a hierarchy of actions at different levels of abstraction. Here we implement all the major components of HRL in a neural model that captures a variety of known anatomical and physiological properties of the brain. We demonstrate the performance of the model in a range of different environments, in order to emphasize the aim of understanding the brain’s general reinforcement learning ability. These results show that the model compares well to previous modelling work and demonstrates improved performance as a result of its hierarchical ability. We also show that the model’s behaviour is consistent with available data on human hierarchical RL, and generate several novel predictions.

[1]  Shie Mannor,et al.  Dynamic abstraction in reinforcement learning via clustering , 2004, ICML.

[2]  EliasmithChris A Unified Approach to Building and Controlling Spiking Attractor Networks , 2005 .

[3]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[4]  Jonathan D. Cohen,et al.  Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement , 2008, NIPS.

[5]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[6]  Y. Niv Reinforcement learning in the brain , 2009 .

[7]  Carlos Diuk,et al.  Hierarchical Learning Induces Two Simultaneous, But Separable, Prediction Errors in Human Basal Ganglia , 2013, The Journal of Neuroscience.

[8]  P. Dayan,et al.  Reinforcement learning: The Good, The Bad and The Ugly , 2008, Current Opinion in Neurobiology.

[9]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[10]  C. Eliasmith,et al.  Dynamic Behaviour of a Spiking Model of Action Selection in the Basal Ganglia Neural Structure , 2010 .

[11]  Samuel J. Gershman,et al.  Computational rationality: A converging paradigm for intelligence in brains, minds, and machines , 2015, Science.

[12]  M. D’Esposito,et al.  Frontal Cortex and the Discovery of Abstract Action Rules , 2010, Neuron.

[13]  Nicolas P. Rougier,et al.  Learning representations in a gated prefrontal cortex model of dynamic task switching , 2002, Cogn. Sci..

[14]  W. Senn,et al.  Reinforcement learning in populations of spiking neurons , 2008, Nature Neuroscience.

[15]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[16]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[17]  P. Redgrave,et al.  The basal ganglia: a vertebrate solution to the selection problem? , 1999, Neuroscience.

[18]  Chris Eliasmith,et al.  A Unified Approach to Building and Controlling Spiking Attractor Networks , 2005, Neural Computation.

[19]  Terrence C. Stewart,et al.  Neuroinformatics Original Research Article Python Scripting in the Nengo Simulator , 2022 .

[20]  Mitsuo Kawato,et al.  Heterarchical reinforcement-learning model for integration of multiple cortico-striatal loops: fMRI examination in stimulus-action-reward association learning , 2006, Neural Networks.

[21]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[22]  Andrew G. Barto,et al.  Using relative novelty to identify useful temporal abstractions in reinforcement learning , 2004, ICML.

[23]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[24]  Chris Eliasmith,et al.  Neural Engineering: Computation, Representation, and Dynamics in Neurobiological Systems , 2004, IEEE Transactions on Neural Networks.

[25]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[26]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[27]  J. Cohen,et al.  Dopamine, cognitive control, and schizophrenia: the gating model. , 1999, Progress in brain research.

[28]  Daniel Rasmussen Hierarchical reinforcement learning in a biologically plausible neural architecture , 2014 .

[29]  Andrew G. Barto,et al.  Behavioral Hierarchy: Exploration and Representation , 2013, Computational and Robotic Models of the Hierarchical Organization of Behavior.

[30]  William W. Lytton,et al.  Reinforcement Learning of Two-Joint Virtual Arm Reaching in a Computer Model of Sensorimotor Cortex , 2013, Neural Computation.

[31]  Clay B. Holroyd,et al.  Motivation of extended behaviors by anterior cingulate cortex , 2012, Trends in Cognitive Sciences.

[32]  Satinder P. Singh,et al.  Reinforcement Learning Algorithms for Average-Payoff Markovian Decision Processes , 1994, AAAI.

[33]  David J. Foster,et al.  A model of hippocampally dependent navigation, using the temporal difference learning rule , 2000, Hippocampus.

[34]  W. Schultz Predictive reward signal of dopamine neurons. , 1998, Journal of neurophysiology.

[35]  James Kozloski,et al.  Self-referential forces are sufficient to explain different dendritic morphologies , 2013, Front. Neuroinform..

[36]  Peter Stone,et al.  The utility of temporal abstraction in reinforcement learning , 2008, AAMAS.

[37]  Joseph T. McGuire,et al.  A Neural Signature of Hierarchical Reinforcement Learning , 2011, Neuron.

[38]  Lilianne R. Mujica-Parodi,et al.  Ventral striatal and medial prefrontal BOLD activation is correlated with reward-related electrocortical activity: A combined ERP and fMRI study , 2011, NeuroImage.

[39]  C. Eliasmith,et al.  Learning to Select Actions with Spiking Neurons in the Basal Ganglia , 2012, Front. Neurosci..

[40]  Barry D. Nichols Reinforcement learning in continuous state- and action-space , 2014 .

[41]  D. Plaut,et al.  Doing without schema hierarchies: a recurrent connectionist approach to normal and impaired routine sequential action. , 2004, Psychological review.

[42]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[43]  P. Dayan,et al.  Model-based influences on humans’ choices and striatal prediction errors , 2011, Neuron.

[44]  Markus Diesmann,et al.  A Spiking Neural Network Model of an Actor-Critic Learning Agent , 2009, Neural Computation.

[45]  Razvan V. Florian,et al.  Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity , 2007, Neural Computation.

[46]  Eytan Ruppin,et al.  Actor-critic models of the basal ganglia: new anatomical and computational perspectives , 2002, Neural Networks.

[47]  Wulfram Gerstner,et al.  Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons , 2013, PLoS Comput. Biol..

[48]  Ari Weinstein,et al.  Model-based hierarchical reinforcement learning and human action control , 2014, Philosophical Transactions of the Royal Society B: Biological Sciences.

[49]  Ron Meir,et al.  Reinforcement Learning, Spike-Time-Dependent Plasticity, and the BCM Rule , 2007, Neural Computation.

[50]  Walter Senn,et al.  Spatio-Temporal Credit Assignment in Neuronal Population Learning , 2011, PLoS Comput. Biol..

[51]  Chris Eliasmith,et al.  A spiking neural model applied to the study of human performance and cognitive decline on Raven's Advanced , 2014 .

[52]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[53]  Chris Eliasmith,et al.  A neural reinforcement learning model for tasks with unknown time delays , 2013, CogSci.

[54]  Michael J. Frank,et al.  Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia , 2006, Neural Computation.

[55]  Joseph J. Paton,et al.  A Scalable Population Code for Time in the Striatum , 2015, Current Biology.

[56]  J. Hollerman,et al.  Reward processing in primate orbitofrontal cortex and basal ganglia. , 2000, Cerebral cortex.

[57]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[58]  Peter Redgrave,et al.  A computational model of action selection in the basal ganglia. I. A new functional anatomy , 2001, Biological Cybernetics.

[59]  Iris van Rooij,et al.  Hierarchies in Action and Motor Control , 2012, Journal of Cognitive Neuroscience.

[60]  Chris Eliasmith,et al.  Fine-Tuning and the Stability of Recurrent Neural Networks , 2011, PloS one.

[61]  Matthew Botvinick,et al.  Divide and Conquer: Hierarchical Reinforcement Learning and Task Decomposition in Humans , 2013, Computational and Robotic Models of the Hierarchical Organization of Behavior.

[62]  C. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[63]  Trevor Bekolay,et al.  Nengo: a Python tool for building large-scale functional brain models , 2014, Front. Neuroinform..

[64]  Alec Solway,et al.  Optimal Behavioral Hierarchy , 2014, PLoS Comput. Biol..

[65]  Bernhard Hengst,et al.  Hierarchical Approaches , 2012, Reinforcement Learning.

[66]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[67]  M. Frank,et al.  Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 1: computational analysis. , 2012, Cerebral cortex.

[68]  Shie Mannor,et al.  Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning , 2002, ECML.

[69]  Anne G E Collins,et al.  Cognitive control over learning: creating, clustering, and generalizing task-set structure. , 2013, Psychological review.

[70]  Marco Mirolli,et al.  Computational and Robotic Models of the Hierarchical Organization of Behavior , 2013, Springer Berlin Heidelberg.

[71]  M Botvinick,et al.  Doing without schema hierarchies: A connectionist approach to routine sequential action and its pathology , 2000 .

[72]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[73]  Wulfram Gerstner,et al.  Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail , 2009, PLoS Comput. Biol..

[74]  M. Botvinick,et al.  Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective , 2009, Cognition.

[75]  Samuel M. McClure,et al.  Hierarchical control over effortful behavior by rodent medial frontal cortex: A computational model. , 2015, Psychological review.

[76]  Ronald A. Howard,et al.  Dynamic Probabilistic Systems , 1971 .

[77]  Robert C. Wilson,et al.  Orbitofrontal Cortex as a Cognitive Map of Task Space , 2014, Neuron.

[78]  G. Schoenbaum,et al.  Neural Encoding in Orbitofrontal Cortex and Basolateral Amygdala during Olfactory Discrimination Learning , 1999, The Journal of Neuroscience.

[79]  Thomas E. Hazy,et al.  PVLV: the primary value and learned value Pavlovian learning algorithm. , 2007, Behavioral neuroscience.

[80]  Markus Werning,et al.  Compositionality and Biologically Plausible Models , 2009 .

[81]  H. Seung,et al.  Learning in Spiking Neural Networks by Reinforcement of Stochastic Synaptic Transmission , 2003, Neuron.

[82]  E. Izhikevich Solving the distal reward problem through linkage of STDP and dopamine signaling , 2007, BMC Neuroscience.