Scalable reinforcement learning through hierarchical decompositions for weakly-coupled problems

Reinforcement Learning, or Reward-Dependent Learning, has been very successful at describing how animals and humans adjust their actions so as to increase their gains and reduce their losses in a wide variety of tasks. Empirical studies have furthermore identified numerous neuronal correlates of quantities necessary for such computations. But, in general it is too expensive for the brain to encode actions and their outcomes with respect to all available dimensions describing the state of the world. This suggests the existence of learning algorithms that are capable of taking advantage of the independencies present in the world and hence reducing the computational costs in terms of representations and learning. A possible solution is to use separate learners for task dimensions with independent dynamics and rewards. But the condition of independence is usually too restrictive. Here, we propose a hierarchical reinforcement learning solution for the more general case in which the dynamics are not independent but weakly coupled and show how to assign credit to the different modules, which solve the task jointly.

[1]  W. Schultz Multiple reward signals in the brain , 2000, Nature Reviews Neuroscience.

[2]  M. Botvinick,et al.  Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective , 2009, Cognition.

[3]  F. Mora,et al.  Brain self-stimulation: direct evidence for the involvement of dopamine in the prefrontal cortex. , 1977, Science.

[4]  P. Huttenlocher Synapse elimination and plasticity in developing human cerebral cortex. , 1984, American journal of mental deficiency.

[5]  E. Vaadia,et al.  Midbrain dopamine neurons encode decisions for future action , 2006, Nature Neuroscience.

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[8]  Stuart J. Russell,et al.  Q-Decomposition for Reinforcement Learning Agents , 2003, ICML.

[9]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[10]  Constantin A Rothkopf,et al.  Image statistics at the point of gaze during human navigation , 2009, Visual Neuroscience.

[11]  H. Loos,et al.  Synaptogenesis in human visual cortex — evidence for synapse elimination during normal development , 1982, Neuroscience Letters.

[12]  Mitsuo Kawato,et al.  Multiple Model-Based Reinforcement Learning , 2002, Neural Computation.

[13]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  J. O'Doherty,et al.  Encoding Predictive Reward Value in Human Amygdala and Orbitofrontal Cortex , 2003, Science.

[16]  M. Hayhoe,et al.  What controls attention in natural environments? , 2001, Vision Research.

[17]  Jürgen Schmidhuber,et al.  HQ-Learning , 1997, Adapt. Behav..

[18]  R. Dolan,et al.  Dopamine-dependent prediction errors underpin reward-seeking behaviour in humans , 2006, Nature.

[19]  Jürgen Schmidhuber,et al.  Planning simple trajectories using neural subgoal generators , 1993 .

[20]  Kee-Eung Kim,et al.  Solving Very Large Weakly Coupled Markov Decision Processes , 1998, AAAI/IAAI.

[21]  P. Huttenlocher Morphometric study of human cerebral cortex development , 1990, Neuropsychologia.

[22]  Mark Humphreys,et al.  Action selection methods using reinforcement learning , 1997 .

[23]  M. Kawato,et al.  Different neural correlates of reward expectation and reward expectation error in the putamen and caudate nucleus during stimulus-action-reward association learning. , 2006, Journal of neurophysiology.

[24]  P. Dayan,et al.  Opinion TRENDS in Cognitive Sciences Vol.10 No.8 Full text provided by www.sciencedirect.com A normative perspective on motivation , 2022 .

[25]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[26]  P. Huttenlocher,et al.  The development of synapses in striate cortex of man. , 1987, Human neurobiology.

[27]  P. Huttenlocher,et al.  Regional differences in synaptogenesis in human cerebral cortex , 1997, The Journal of comparative neurology.

[28]  Dana H. Ballard,et al.  Credit Assignment in Multiple Goal Embodied Visuomotor Behavior , 2010, Front. Psychology.

[29]  Dana H. Ballard,et al.  Modular models of task based visually guided behavior , 2009 .

[30]  Timothy C Rickard,et al.  Taxing executive processes does not necessarily increase impulsive decision making. , 2010, Experimental psychology.

[31]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.