Integration of Partially Observable Markov Decision Processes and Reinforcement Learning for Simulat

This dissertation presents a two level architecture for goal-directed robot control. The low level actions are learned on-line as the robot performs its tasks, thereby reducing the need for the system designer to program for every possible contingency. The actions are adaptive to failures in sensors and effectors, allowing the robot to perform its assigned tasks despite hardware failure. Reactivity, deliberation, and learning are an integral part of the architecture. The architecture uses a partially observable Markov decision process (POMDP) model for planning, and reinforcement learning (RL) for low level actions. In addition to the robot architecture, this dissertation presents and evaluates a new parallel POMDP solution algorithm and a new algorithm for using decision trees to perform function approximation in RL. New low level actions may be instantiated with no knowledge of what state transition they are supposed to accomplish. The patterns of reward and punishment cause them to each learn to perform their assigned state transitions. In the event of sensor or effector failure, the low level actions adapt so as to maximize reward even with reduced sensor information or effector availability. Experiments are conducted in a simulated maze-like environment to compare different versions of the architecture. In the first experiment, hand coded actions are used. The remaining experiments compare the performance of the system using hand coded actions to the performance of the system using learned actions. A final experiment demonstrates that the system can learn a new action that was not pre-specified by the system designer. The experiments demonstrate that the combination of POMDP planning and reinforcement learning provides a very reactive system that can also achieve long term goals, adapt to failures, and learn new low level actions. In order to demonstrate the robot control architecture, it was necessary to improve or modify existing approaches to reinforcement learning and POMDP planning. The approach to learning low level actions is different from any previous approach, and the experimental results indicate it performs well in the simulated maze-like environment.

[1]  Carme Torras Neural Learning for Robot Control , 1994, ECAI.

[2]  R U Muller,et al.  Head-direction cells recorded from the postsubiculum in freely moving rats. I. Description and quantitative analysis , 1990, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[3]  Michael A. Arbib,et al.  Schema theory , 1998 .

[4]  Marco Colombetti,et al.  Robot Shaping: An Experiment in Behavior Engineering , 1997 .

[5]  Charles W. Anderson,et al.  Strategy Learning with Multilayer Connectionist Representations , 1987 .

[6]  Leslie Pack Kaelbling,et al.  The Synthesis of Digital Machines With Provable Epistemic Properties , 1986, TARK.

[7]  H. Bastian Sensation and Perception.—I , 1869, Nature.

[9]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[10]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[11]  Jean-Arcady Meyer,et al.  Hierarchical Map Building and Self-Positioning with MonaLysa , 1996, Adapt. Behav..

[12]  Mark Ring Two methods for hierarchy learning in reinforcement environments , 1993 .

[13]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[14]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[15]  Sebastian Thrun,et al.  Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[16]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[17]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[18]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[19]  Ian H. Witten,et al.  An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[20]  Mark Ring Sequence Learning with Incremental Higher-Order Neural Networks , 1993 .

[21]  Wenju Liu,et al.  Planning in Stochastic Domains: Problem Characteristics and Approximation , 1996 .

[22]  Satinder Singh Soft Dynamic Programming Algorithms: Convergence Proofs Soft Dynamic Programming Algorithms: Convergence Proofs , 1993 .

[23]  Blai Bonet High-Level Planning and Control with Incomplete Information Using POMDP's , 1998 .

[24]  Andreas Stafylopatis,et al.  Collision-Free Movement of an Autonomous Vehicle Using Reinforcement Learning , 1992, ECAI.

[25]  Larry D. Pyeatt,et al.  Integrating POMDP and reinforcement learning for a two layer simulated robot architecture , 1999, AGENTS '99.

[26]  R. A. Brooks,et al.  Intelligence without Representation , 1991, Artif. Intell..

[27]  Leslie Pack Kaelbling,et al.  Planning With Deadlines in Stochastic Domains , 1993, AAAI.

[28]  N. Zhang,et al.  Algorithms for partially observable markov decision processes , 2001 .

[29]  Judy Goldsmith,et al.  Complexity issues in Markov decision processes , 1998, Proceedings. Thirteenth Annual IEEE Conference on Computational Complexity (Formerly: Structure in Complexity Theory Conference) (Cat. No.98CB36247).

[30]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[31]  James L. Crowley,et al.  Learning locomotion reflexes: A self-supervised neural system for a mobile robot , 1994, Robotics Auton. Syst..

[32]  David L. Poole,et al.  A Framework for Decision-Theoretic Planning I: Combining the Situation Calculus, Conditional Plans, Probability and Utility , 1996, UAI.

[33]  F. Keil Concepts, Kinds, and Cognitive Development , 1989 .

[34]  Wolfram Burgard,et al.  Coastal Navigation { Robot Motion with Uncertainty , 1998, AAAI 1998.

[35]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[36]  Lee Spector,et al.  Evolving teamwork and coordination with genetic programming , 1996 .

[37]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[38]  Manuela M. Veloso,et al.  Tree Based Discretization for Continuous State Space Reinforcement Learning , 1998, AAAI/IAAI.

[39]  R. Simmons,et al.  Probabilistic Navigation in Partially Observable Environments , 1995 .

[40]  Philippe Lalanda,et al.  A Domain-Specific Software Architecture for Adaptive Intelligent Systems , 1995, IEEE Trans. Software Eng..

[41]  M. Littman,et al.  Efficient dynamic-programming updates in partially observable Markov decision processes , 1995 .

[42]  Reid G. Simmons,et al.  Risk-Sensitive Planning with Probabilistic Decision Graphs , 1994, KR.

[43]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[44]  Alessandro Saffiotti Some Notes on the Integration of Planning and Reactivity in Autonomous Mobile Robots , 1993 .

[45]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[46]  Maja J. Matarić,et al.  Behavior-Based Systems: Key Properties and Implications , 1992 .

[47]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[48]  Sher ry Folsom-Meek,et al.  Human Performance , 2020, Nature.

[49]  Michael P. Georgeff,et al.  Decision-Making in an Embedded Reasoning System , 1989, IJCAI.

[50]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[51]  J. Bruner Organization of early skilled action. , 1973, Child development.

[52]  Mark B. Ring Learning Sequential Tasks by Incrementally Adding Higher Orders , 1992, NIPS.

[53]  Donald C. Wunsch,et al.  Convergence of critic-based training , 1997, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[54]  Claude Sammut,et al.  Automatic construction of reactive control systems using symbolic machine learning , 1996, The Knowledge Engineering Review.

[55]  Marcelo H. Ang,et al.  Performance of a neuro-model-based robot controller: adaptability and noise rejection , 1992 .

[56]  T. Smithers,et al.  A behavioural approach to robot task planning and off-line programming , 1987 .

[57]  R. Passingham The hippocampus as a cognitive map J. O'Keefe & L. Nadel, Oxford University Press, Oxford (1978). 570 pp., £25.00 , 1979, Neuroscience.

[58]  Stephen S. Lee,et al.  Planning with Partially Observable Markov Decision Processes: Advances in Exact Solution Method , 1998, UAI.

[59]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[60]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[61]  Frédéric Gruau,et al.  Cellular Encoding for interactive evolutionary robotics , 1996 .

[62]  Charles W. Anderson,et al.  Q-Learning with Hidden-Unit Restarting , 1992, NIPS.

[63]  Michael P. Wellman,et al.  Planning and Control , 1991 .

[64]  T. Iberall,et al.  Neural network architecture for robot hand control , 1989, IEEE Control Systems Magazine.

[65]  Chelsea C. White,et al.  A survey of solution techniques for the partially observed Markov decision process , 1991, Ann. Oper. Res..

[66]  Stuart J. Russell,et al.  Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[67]  Michael L. Littman,et al.  Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.

[68]  Stephen Grossberg,et al.  A self-organizing neural network model for redundant sensory-motor control, motor equivalence, and tool use , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[69]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[70]  Charles W. Anderson,et al.  Reinforcement Learning with Modular Neural Networks for Control , 1994 .

[71]  Nils J. Nilsson,et al.  Shakey the Robot , 1984 .

[72]  R. A. McCallum First Results with Instance-Based State Identification for Reinforcement Learning , 1994 .

[73]  Ronald C. Arkin,et al.  An Behavior-based Robotics , 1998 .

[74]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[75]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[76]  Robert James Firby,et al.  Adaptive execution in complex dynamic worlds , 1989 .

[77]  David E. Wilkins,et al.  Domain-Independent Planning: Representation and Plan Generation , 1984, Artif. Intell..

[78]  Ron Sun,et al.  Robust Reasoning: Integrating Rule-Based and Similarity-Based Reasoning , 1995, Artif. Intell..

[79]  Reid G. Simmons,et al.  Robot Navigation with Markov Models: A Framework for Path Planning and Learning with Limited Computational Resources , 1995, Reasoning with Uncertainty in Robotics.

[80]  Prasad Tadepalli,et al.  Model-Based Average Reward Reinforcement Learning , 1998, Artif. Intell..

[81]  J. O'Keefe,et al.  The hippocampus as a spatial map. Preliminary evidence from unit activity in the freely-moving rat. , 1971, Brain research.

[82]  Nicholas Kushmerick,et al.  An Algorithm for Probabilistic Planning , 1995, Artif. Intell..

[83]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[84]  Francois Felix Ingrand,et al.  Monitoring and control of spacecraft systems using procedural reasoning , 1990 .

[85]  Steven L. Salzberg,et al.  On growing better decision trees from data , 1996 .

[86]  Charles W. Anderson,et al.  Comparison of CMACs and radial basis functions for local function approximators in reinforcement learning , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[87]  Jean-Arcady Meyer,et al.  Place Sequence Learning for Navigation , 1997, ICANN.

[88]  John D. Lowrance,et al.  Planning and reacting in uncertain and dynamic environments , 1995, J. Exp. Theor. Artif. Intell..

[89]  Hector Geffner,et al.  Solving Large POMDPs using Real Time Dynamic Programming , 1998 .

[90]  Eric A. Hansen,et al.  Solving POMDPs by Searching in Policy Space , 1998, UAI.

[91]  Larry D. Pyeatt,et al.  REINFORCEMENT LEARNING FOR COORDINATED REACTIVE CONTROL , 1998 .

[92]  Jonathan H. Connell,et al.  A colony architecture for an artificial creature , 1989 .

[93]  Jorg-Michael Hasemann,et al.  Robot control architectures: application requirements, approaches, and technologies , 1995, Other Conferences.

[94]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[95]  Rodney A. Brooks,et al.  A Robust Layered Control Syste For A Mobile Robot , 2022 .

[96]  Cynthia Ferrell,et al.  Failure Recognition and Fault Tolerance of an Autonomous Robot , 1994, Adapt. Behav..

[97]  Arthur L. Samuel,et al.  Some studies in machine learning using the game of checkers , 2000, IBM J. Res. Dev..

[98]  Maja J. Mataric,et al.  Integration of representation into goal-driven behavior-based robots , 1992, IEEE Trans. Robotics Autom..

[99]  Edward J. Sondik,et al.  The optimal control of par-tially observable Markov processes , 1971 .

[100]  Marco Colombetti,et al.  Robot Shaping: Developing Autonomous Agents Through Learning , 1994, Artif. Intell..

[101]  Gerald Tesauro,et al.  Neurogammon: a neural-network backgammon program , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[102]  Milos Hauskrecht,et al.  Incremental Methods for Computing Bounds in Partially Observable Markov Decision Processes , 1997, AAAI/IAAI.

[103]  A. Barto,et al.  Learning and Sequential Decision Making , 1989 .

[104]  Satinder P. Singh,et al.  Scaling Reinforcement Learning Algorithms by Learning Variable Temporal Resolution Models , 1992, ML.

[105]  Leslie Pack Kaelbling,et al.  Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[106]  Sean Luke,et al.  Genetic Programming Produced Competitive Soccer Softbot Teams for RoboCup97 , 1998 .

[107]  L. Nadel,et al.  The Hippocampus as a Cognitive Map , 1978 .

[108]  G. Monahan State of the Art—A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms , 1982 .

[109]  RU Muller,et al.  The hippocampus as a cognitive graph , 1996, The Journal of general physiology.

[110]  Rodney A. Brooks,et al.  Integrated systems based on behaviors , 1991, SGAR.

[111]  Roderic A. Grupen,et al.  Robust Reinforcement Learning in Motion Planning , 1993, NIPS.

[112]  Naohiro Fukumura,et al.  Learning goal-directed sensory-based navigation of a mobile robot , 1994, Neural Networks.

[113]  J. Millán,et al.  A Reinforcement Connectionist Approach to Robot Path Finding in Non-Maze-Like Environments , 2004, Machine Learning.

[114]  Kurt W. Fischer,et al.  Human Development: From Conception Through Adolescence , 1984 .

[115]  M. Littman The Witness Algorithm: Solving Partially Observable Markov Decision Processes , 1994 .

[116]  Karen Zita Haigh,et al.  A layered architecture for office delivery robots , 1997, AGENTS '97.

[117]  Reid G. Simmons,et al.  Probabilistic Robot Navigation in Partially Observable Environments , 1995, IJCAI.

[118]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[119]  P. Grobstein From Animals to Animats 2: Proceedings of the Second International Conference on Simulation of Adaptive Behavior , 1994 .

[120]  Ulrich Nehmzow,et al.  Autonomous Acquisition of Sensor-Motor Couplings in Robots , 1994 .

[121]  J. Lammens,et al.  Behavior Based Ai, Cognitive Processes, and Emergent Behaviors in Autonomous Agents , 1993 .

[122]  Leslie Pack Kaelbling,et al.  Acting under uncertainty: discrete Bayesian models for mobile-robot navigation , 1996, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS '96.

[123]  Satinder P. Singh,et al.  The Efficient Learning of Multiple Task Sequences , 1991, NIPS.

[124]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[125]  G. B. Andeen,et al.  Structured neural-network approach to robot motion control , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[126]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[127]  S.J.J. Smith,et al.  Empirical Methods for Artificial Intelligence , 1995 .

[128]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.