Q-Error as a Selection Mechanism in Modular Reinforcement-Learning Systems

This paper introduces a novel multimodular method for reinforcement learning. A multimodular system is one that partitions the learning task among a set of experts (modules), where each expert is incapable of solving the entire task by itself. There are many advantages to splitting up large tasks in this way, but existing methods face difficulties when choosing which module(s) should contribute to the agent's actions at any particular moment. We introduce a novel selection mechanism where every module, besides calculating a set of action values, also estimates its own error for the current input. The selection mechanism combines each module's estimate of long-term reward and self-error to produce a score by which the next module is chosen. As a result, the modules can use their resources effectively and efficiently divide up the task. The system is shown to learn complex tasks even when the individual modules use only linear function approximators.

[1]  Jürgen Schmidhuber,et al.  HQ-Learning , 1997, Adapt. Behav..

[2]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[3]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[4]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[5]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[6]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[7]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[8]  Stewart W. Wilson Classifier Fitness Based on Accuracy , 1995, Evolutionary Computation.

[9]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10]  M. V. Rossum,et al.  In Neural Computation , 2022 .

[11]  Bram Bakker,et al.  Hierarchical Reinforcement Learning Based on Subgoal Discovery and Subpolicy Specialization , 2003 .

[12]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[13]  Mitsuo Kawato,et al.  Inter-module credit assignment in modular reinforcement learning , 2003, Neural Networks.

[14]  Jonas Karlsson,et al.  Learning via task decomposition , 1993 .

[15]  Mitsuo Kawato,et al.  Multiple Model-Based Reinforcement Learning , 2002, Neural Computation.

[16]  D. L. Corgan,et al.  King's College , 1867, British medical journal.

[17]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[18]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[19]  John H. Holland,et al.  Properties of the Bucket Brigade , 1985, ICGA.

[20]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[21]  D M Wolpert,et al.  Multiple paired forward and inverse models for motor control , 1998, Neural Networks.

[22]  Satinder Singh Transfer of Learning by Composing Solutions of Elemental Sequential Tasks , 1992, Mach. Learn..

[23]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[24]  Satinder Singh Transfer of learning by composing solutions of elemental sequential tasks , 2004, Machine Learning.

[25]  Mario Tokoro,et al.  An Adaptive Architecture for Modular Q-Learning , 1997, IJCAI.