Modular on-line function approximation for scaling up reinforcement learning

Reinforcement l e a r n i n g i s a p o werful learning paradigm for autonomous agents which i n teract with unknown environments with the objective of maximizing cumulative p a yoo. Recent research has addressed issues concerning the scaling up of reinforcement learning methods in order to solve problems with large state spaces, composite tasks and tasks involving non-Markovian situations. In this dissertation, I extend existing ways of scaling up reinforcement learning methods and propose several new approaches. An array of Cerebellar Model Articulation Controller (CMAC) networks is used as fast function approximators so that the evaluation function and policy can be learnt on-line as the agent i n teracts with the environment. Learning systems which combine reinforcement learning techniques with CMAC networks are developed to solve problems with large state and action spaces. Actions can be either discrete or real-valued. The problem of generating a sequence of torque or position change commands in order to drive a simulated multi-linked manipulator towards desired arm conngurations is examined. A hierarchical and modular function approximation architecture using CMAC n e t works is then developed, following the Hierarchical Mixtures of Experts framework. The non-linear function approximation ability o f C M A C n e t works enables non-linear functions to be modelled in expert and gating networks, while permitting fast linear learning rules to be used. An on-line gradient ascent learning procedure derived from the Expectation Maximization algorithm is proposed, enabling faster learning to be achieved. The new architecture can be used to enable reinforcement learning agents to acquire context-dependent evaluation functions and policies. This is demonstrated in an implementation of the Compositional Q-Learning framework in which composite tasks consisting of several elemental tasks are decomposed using reinforcement learning. The framework is extended to the case where rewards can be received in non-terminal states of elemental tasks, and tòvector of actions' situations where the agent produces several coordinated actions in order to achieve a goal. The resulting system is employed to enable the simulated multi-linked manipulator to position its end-eeector at several positions in the workspace sequentially. Finally, the beneets of using prior knowledge in order to extend the capabilities of reinforcement learning agents are examined. A classiier system-based Q-learning scheme is developed to enable agents to reason using condition-action rules. The utility o f t h i s s c heme is illustrated in a …

[1]  Kumpati S. Narendra,et al.  Learning Automata - A Survey , 1974, IEEE Trans. Syst. Man Cybern..

[2]  Chris Chatfield,et al.  Statistics for Technology (A Course in Applied Statistics) , 1984 .

[3]  James S. Albus,et al.  Data Storage in the Cerebellar Model Articulation Controller (CMAC) , 1975 .

[4]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  John M. Hollerbach,et al.  A Recursive Lagrangian Formulation of Manipulator Dynamics , 1980 .

[7]  John M. Hollerbach,et al.  A Recursive Lagrangian Formulation of Maniputator Dynamics and a Comparative Study of Dynamics Formulation Complexity , 1980, IEEE Transactions on Systems, Man, and Cybernetics.

[8]  Tomás Lozano-Pérez,et al.  Automatic Planning of Manipulator Transfer Movements , 1981, IEEE Transactions on Systems, Man, and Cybernetics.

[9]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[10]  P. Anandan,et al.  Pattern-recognizing stochastic learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[11]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[12]  John H. Holland,et al.  Escaping brittleness: the possibilities of general-purpose learning algorithms applied to parallel rule-based systems , 1995 .

[13]  King-Sun Fu,et al.  Learning Control Systems-Review and Outlook , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[15]  Scott E. Fahlman,et al.  An empirical study of learning speed in back-propagation networks , 1988 .

[16]  John E. Moody,et al.  Fast Learning in Multi-Resolution Hierarchies , 1988, NIPS.

[17]  Kimon P. Valavanis,et al.  Analytical design of intelligent machines , 1985, Autom..

[18]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[19]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[20]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[21]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[22]  Rodney A. Brooks,et al.  A robot that walks; emergent behaviors from a carefully evolved network , 1989, Proceedings, 1989 International Conference on Robotics and Automation.

[23]  W. Thomas Miller,et al.  Real-time application of neural networks for sensor-based control of robots with vision , 1989, IEEE Trans. Syst. Man Cybern..

[24]  Pattie Maes,et al.  Designing autonomous agents: Theory and practice from biology to engineering and back , 1990, Robotics Auton. Syst..

[25]  Vijaykumar Gullapalli,et al.  A stochastic reinforcement learning algorithm for learning real-valued functions , 1990, Neural Networks.

[26]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[27]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[28]  John C. Platt A Resource-Allocating Network for Function Interpolation , 1991, Neural Computation.

[29]  John E. W. Mayhew,et al.  Obstacle Avoidance through Reinforcement Learning , 1991, NIPS.

[30]  Michael I. Jordan,et al.  Hierarchies of Adaptive Experts , 1991, NIPS.

[31]  Steven J. Nowlan,et al.  Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures , 1991 .

[32]  D. Ellison,et al.  On the Convergence of the Multidimensional Albus Perceptron , 1991, Int. J. Robotics Res..

[33]  Hyongsuk Kim,et al.  CMAC-based adaptive critic self-learning control , 1991, IEEE Trans. Neural Networks.

[34]  Michael I. Jordan,et al.  Task Decomposition Through Competition in a Modular Connectionist Architecture: The What and Where Vision Tasks , 1990, Cogn. Sci..

[35]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[36]  S. Thrun Eecient Exploration in Reinforcement Learning , 1992 .

[37]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[38]  Vijaykumar Gullapalli,et al.  Learning Control Under Extreme Uncertainty , 1992, NIPS.

[39]  Andrew W. Moore,et al.  Memory-Based Reinforcement Learning: Efficient Computation with Prioritized Sweeping , 1992, NIPS.

[40]  Michael I. Jordan,et al.  Forward Models: Supervised Learning with a Distal Teacher , 1992, Cogn. Sci..

[41]  Satinder Singh The Ecient Learning of Multiple Task Sequences , 1992 .

[42]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[43]  Steven J. Bradtke,et al.  Reinforcement Learning Applied to Linear Quadratic Regulation , 1992, NIPS.

[44]  Charles W. Anderson,et al.  Q-Learning with Hidden-Unit Restarting , 1992, NIPS.

[45]  Vijaykumar Gullapalli,et al.  Reinforcement learning and its application to control , 1992 .

[46]  David A. Cohn,et al.  Neural Network Exploration Using Optimal Experiment Design , 1993, NIPS.

[47]  L.-J. Lin,et al.  Hierarchical learning of robot skills by reinforcement , 1993, IEEE International Conference on Neural Networks.

[48]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[49]  Visakan Kadirkamanathan,et al.  A Function Estimation Approach to Sequential Learning with Neural Networks , 1993, Neural Computation.

[50]  Roderic A. Grupen,et al.  Robust Reinforcement Learning in Motion Planning , 1993, NIPS.

[51]  Radford M. Neal A new view of the EM algorithm that justifies incremental and other variants , 1993 .

[52]  David J. Spiegelhalter,et al.  Bayesian analysis in expert systems , 1993 .

[53]  Steven J. Nowlan,et al.  Mixtures of Controllers for Jump Linear and Non-Linear Plants , 1993, NIPS.

[54]  Terrence J. Sejnowski,et al.  Temporal Difference Learning of Position Evaluation in the Game of Go , 1993, NIPS.

[55]  Sebastian Thrun,et al.  Exploration and model building in mobile robot domains , 1993, IEEE International Conference on Neural Networks.

[56]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[57]  Richard S. Sutton,et al.  Online Learning with Random Representations , 1993, ICML.

[58]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[59]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[60]  Ronald J. Williams,et al.  Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Cr , 1993 .

[61]  Satinder Singh,et al.  Learning to Solve Markovian Decision Processes , 1993 .

[62]  Jing Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, Adapt. Behav..

[63]  J. Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, IEEE International Conference on Neural Networks.

[64]  Long Ji Lin,et al.  Scaling Up Reinforcement Learning for Robot Control , 1993, International Conference on Machine Learning.

[65]  Leslie Pack Kaelbling,et al.  Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[66]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[67]  Tony J. Prescott,et al.  Explorations in Reinforcement and Model-based Learning , 1994 .

[68]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[69]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[70]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[71]  Richard W. Prager,et al.  A Modular Q-Learning Architecture for Manipulator Task Decomposition , 1994, ICML.

[72]  A. Piper Object-oriented divide-and-conquer for parallel processing , 1994 .

[73]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[74]  Geoffrey E. Hinton,et al.  An Alternative Model for Mixtures of Experts , 1994, NIPS.

[75]  Steve R. Waterhouse,et al.  Classification using hierarchical mixtures of experts , 1994, Proceedings of IEEE Workshop on Neural Networks for Signal Processing.

[76]  Stewart W. Wilson ZCS: A Zeroth Level Classifier System , 1994, Evolutionary Computation.

[77]  V. Gullapalli,et al.  Acquiring robot skills via reinforcement learning , 1994, IEEE Control Systems.

[78]  Mark Humphrys W-learning: Competition among selfish Q-learners , 1995 .

[79]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[80]  Long Ji Lin,et al.  Reinforcement Learning of Non-Markov Decision Processes , 1995, Artif. Intell..

[81]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[82]  B. Pasik-Duncan,et al.  Adaptive Control , 1996, IEEE Control Systems.