Learning to Solve Markovian Decision Processes

This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have developed reinforcement learning (RL) algorithms based on dynamic programming (DP) that use the agent''s experience in its environment to improve its decision policy incrementally. This is achieved by adapting an evaluation function in such a way that the decision policy that is ``greedy'''' with respect to it improves with experience. This dissertation focuses on finite, stationary and Markovian environments for two reasons: it allows the development and use of a strong theory of RL, and there are many challenging real-world RL tasks that fall into this category. This dissertation establishes a novel connection between stochastic approximation theory and RL that provides a uniform framework for understanding all the different RL algorithms that have been proposed to date. It also highlights a dimension that clearly separates all RL research from prior work on DP. Two other theoretical results showing how approximations affect performance in RL provide partial justification for the use of compact function approximators in RL. In addition, a new family of ``soft'''' DP algorithms is presented. These algorithms converge to solutions that are more robust than the solutions found by classical DP algorithms. Despite all of the theoretical progress, conventional RL architectures scale poorly enough to make them impractical for many real-world problems. This dissertation studies two aspects of the scaling issue: the need to accelerate RL, and the need to build RL architectures that can learn to solve multiple tasks. It presents three RL architectures, CQ-L, H-DYNA, and BB-RL, that accelerate learning by facilitating transfer of training from simple to complex tasks. Each architecture uses a different method to achieve transfer of training: CQ-L uses the evaluation functions for simple tasks as building blocks to construct the evaluation function for complex tasks. H-DYNA uses the evaluation functions for simple tasks to build an abstract environment model, and BB-RL uses the decision policies found for the simple tasks as the primitive actions for the complex tasks. A mixture of theoretical and empirical results are presented to support the new RL architectures developed in this dissertation.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[3]  F. Downton Stochastic Approximation , 1969, Nature.

[4]  Donald E. Kirk,et al.  Optimal control theory : an introduction , 1970 .

[5]  Richard Fikes,et al.  Learning and Executing Generalized Robot Plans , 1993, Artif. Intell..

[6]  Earl D. Sacerdoti,et al.  Planning in a Hierarchy of Abstraction Spaces , 1974, IJCAI.

[7]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[8]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[9]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[10]  Martin L. Puterman,et al.  THE ANALYTIC THEORY OF POLICY ITERATION , 1978 .

[11]  M. Puterman,et al.  Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[12]  Drew McDermott,et al.  Planning and Acting , 1978, Cogn. Sci..

[13]  R. Korf Learning to solve problems by searching for macro-operators , 1983 .

[14]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[15]  Graham C. Goodwin,et al.  Adaptive filtering prediction and control , 1984 .

[16]  P. Anandan,et al.  Pattern-recognizing stochastic learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[17]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[18]  Charles W. Anderson,et al.  Learning and problem-solving with multilayer connectionist systems (adaptive, strategy learning, neural networks, reinforcement learning) , 1986 .

[19]  James L. McClelland,et al.  Parallel Distributed Processing: Explorations in the Microstructure of Cognition : Psychological and Biological Models , 1986 .

[20]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[21]  Lawrence D. Jackel,et al.  Large Automatic Learning, Rule Extraction, and Generalization , 1987, Complex Syst..

[22]  Paul J. Werbos,et al.  Building and Understanding Adaptive Systems: A Statistical/Numerical Approach to Factory Automation and Brain Research , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[23]  Ronald L. Rivest,et al.  Game Tree Searching by Min/Max Approximation , 1987, Artif. Intell..

[24]  John H. Holland,et al.  Induction: Processes of Inference, Learning, and Discovery , 1987, IEEE Expert.

[25]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[26]  Philip E. Agre,et al.  The dynamic structure of everyday life , 1988 .

[27]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[28]  Richard S. Sutton,et al.  Sequential Decision Problems and Neural Networks , 1989, NIPS 1989.

[29]  R. Sutton,et al.  Connectionist Learning for Control: An Overview , 1989 .

[30]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[31]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[32]  A. Barto,et al.  Learning and Sequential Decision Making , 1989 .

[33]  Christian Lebiere,et al.  The Cascade-Correlation Learning Architecture , 1989, NIPS.

[34]  Michael I. Jordan,et al.  Learning to Control an Unstable System with Forward Modeling , 1989, NIPS.

[35]  John J. Grefenstette,et al.  Incremental Learning of Control Strategies with Genetic algorithms , 1989, ML.

[36]  Rodney A. Brooks,et al.  A robot that walks; emergent behaviors from a carefully evolved network , 1989, Proceedings, 1989 International Conference on Robotics and Automation.

[37]  L. Baird,et al.  A MATHEMATICAL ANALYSIS OF ACTOR-CRITIC ARCHITECTURES FOR LEARNING OPTIMAL CONTROLS THROUGH INCREMENTAL DYNAMIC PROGRAMMING , 1990 .

[38]  Michael I. Jordan,et al.  Task Decomposition through Competition in A , 1990 .

[39]  Paul J. Werbos,et al.  Neurocontrol and related techniques , 1990 .

[40]  Pattie Maes,et al.  Designing autonomous agents: Theory and practice from biology to engineering and back , 1990, Robotics Auton. Syst..

[41]  Paul E. Utgoff,et al.  Explaining Temporal Differences to Create Useful Concepts for Evaluating States , 1990, AAAI.

[42]  Richard E. Korf,et al.  Real-Time Heuristic Search , 1990, Artif. Intell..

[43]  Paul J. Werbos,et al.  Consistency of HDP applied to a simple reinforcement learning problem , 1990, Neural Networks.

[44]  Dana H. Ballard,et al.  Active Perception and Reinforcement Learning , 1990, Neural Computation.

[45]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[46]  Jonathan Bachrach,et al.  A Connectionist Learning Control Architecture for Navigation , 1990, NIPS.

[47]  Andrew G. Barto,et al.  On the Computational Economics of Reinforcement Learning , 1991 .

[48]  Michael I. Jordan,et al.  Hierarchies of Adaptive Experts , 1991, NIPS.

[49]  Jürgen Schmidhuber,et al.  A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[50]  Leslie Pack Kaelbling,et al.  Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[51]  Michael I. Jordan,et al.  Internal World Models and Supervised Learning , 1991, ML.

[52]  Roderic A. Grupen,et al.  Planning grasp strategies for multifingered robot hands , 1991, Proceedings. 1991 IEEE International Conference on Robotics and Automation.

[53]  Maja J. Matarić A Comparative Analysis of Reinforcement Learning Methods , 1991 .

[54]  R. A. Brooks,et al.  Intelligence without Representation , 1991, Artif. Intell..

[55]  Richard S. Sutton,et al.  Planning by Incremental Dynamic Programming , 1991, ML.

[56]  Sebastian Thrun,et al.  Active Exploration in Dynamic Environments , 1991, NIPS.

[57]  Satinder P. Singh,et al.  Transfer of Learning Across Compositions of Sequentail Tasks , 1991, ML.

[58]  Michael I. Jordan,et al.  Task Decomposition Through Competition in a Modular Connectionist Architecture: The What and Where Vision Tasks , 1990, Cogn. Sci..

[59]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[60]  Rodney A. Brooks,et al.  Intelligence Without Reason , 1991, IJCAI.

[61]  Paul E. Utgoff,et al.  Two Kinds of Training Information For Evaluation Function Learning , 1991, AAAI.

[62]  Michael P. Wellman,et al.  Planning and Control , 1991 .

[63]  R.J. Williams,et al.  Reinforcement learning is direct adaptive optimal control , 1991, IEEE Control Systems.

[64]  S. Thrun Eecient Exploration in Reinforcement Learning , 1992 .

[65]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[66]  Richard Yee,et al.  Abstraction in Control Learning , 1992 .

[67]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[68]  Satinder P. Singh,et al.  Scaling Reinforcement Learning Algorithms by Learning Variable Temporal Resolution Models , 1992, ML.

[69]  Satinder Singh The Ecient Learning of Multiple Task Sequences , 1992 .

[70]  Sridhar Mahadevan,et al.  Enhancing Transfer in Reinforcement Learning by Building Stochastic Models of Robot Actions , 1992, ML.

[71]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[72]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[73]  Steven J. Bradtke,et al.  Reinforcement Learning Applied to Linear Quadratic Regulation , 1992, NIPS.

[74]  Steven Douglas Whitehead,et al.  Reinforcement learning for the adaptive control of perception and action , 1992 .

[75]  Satinder P. Singh,et al.  Reinforcement Learning with a Hierarchy of Abstract Models , 1992, AAAI.

[76]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[77]  Vijaykumar Gullapalli,et al.  Reinforcement learning and its application to control , 1992 .

[78]  Paul E. Utgoff,et al.  A Teaching Method for Reinforcement Learning , 1992, ML.

[79]  R. Grupen,et al.  Harmonic Control , 1992 .

[80]  Sebastian Thrun,et al.  Explanation-Based Neural Network Learning for Robot Control , 1992, NIPS.

[81]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[82]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[83]  Tom M. Mitchell,et al.  Reinforcement learning with hidden states , 1993 .

[84]  Etienne Barnard,et al.  Temporal-difference methods and Markov models , 1993, IEEE Trans. Syst. Man Cybern..

[85]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[86]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[87]  Roderic A. Grupen,et al.  The applications of harmonic functions to robotics , 1993, J. Field Robotics.

[88]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[89]  Jing Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, Adapt. Behav..

[90]  Bernard Delyon,et al.  Accelerated Stochastic Approximation , 1993, SIAM J. Optim..

[91]  J. Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, IEEE International Conference on Neural Networks.

[92]  G. Kane Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 1: Foundations, vol 2: Psychological and Biological Models , 1994 .

[93]  Andrew G. Barto,et al.  Reinforcement Learning and Dynamic Programming , 1995 .

[94]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[95]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..