Hierarchical control and learning for markov decision processes

This dissertation investigates the use of hierarchy and problem decomposition as a means of solving large, stochastic, sequential decision problems. These problems are framed as Markov decision problems (MDPs). The new technical content of this dissertation begins with a discussion of the concept of temporal abstraction. Temporal abstraction is shown to be equivalent to the transformation of a policy defined over a region of an MDP to an action in a semi-Markov decision problem (SMDP). Several algorithms are presented for performing this transformation efficiently. This dissertation introduces the HAM for generating hierarchical, temporally abstract actions. This method permits the partial specification of abstract actions in a way that corresponds to an abstract plan or strategy. Abstract actions specified as HAMs can be optimally refined for new tasks by solving a reduced SMDP. The formal results show that traditional MDP algorithms can be used to optimally refine HAMs for new tasks. This can be achieved in much less time than it would take to learn a new policy for the task from scratch. HAMs complement some novel decomposition algorithms that are presented in this dissertation. These algorithms work by constructing a cache of policies for different regions of the MDP and then optimally combining the cached solution to produce a global solution that is within provable bounds of the optimal solution. Together, the methods developed in this dissertation provide important tools for producing good policies for large MDPs. Unlike some ad-hoc methods, these methods provide strong formal guarantees. They use prior knowledge in a principled way, and they reduce larger MDPs into smaller ones while maintaining a well-defined relationship between the smaller problem and the larger problem.

[1]  Rutherford Aris,et al.  Discrete Dynamic Programming , 1965, The Mathematical Gazette.

[2]  Austin Tate,et al.  Generating Project Networks , 1977, IJCAI.

[3]  D. Rose,et al.  Generalized nested dissection , 1977 .

[4]  P. Varaiya,et al.  Multilayer control of large Markov chains , 1978 .

[5]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[6]  Rodney A. Brooks,et al.  A Robust Layered Control Syste For A Mobile Robot , 2022 .

[7]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[8]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[9]  Gerald Tesauro,et al.  Neurogammon Wins Computer Olympiad , 1989, Neural Computation.

[10]  C. Golaszewski On the supervisory control of discrete event systems , 1989 .

[11]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[12]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[13]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[14]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[15]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[16]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[17]  Craig Boutilier,et al.  Using Abstractions for Decision-Theoretic Planning with Time Constraints , 1994, AAAI.

[18]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[19]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[20]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[21]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[22]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[23]  Nils J. Nilsson,et al.  Teleo-Reactive Programs for Agent Control , 1993, J. Artif. Intell. Res..

[24]  Sebastian Thrun,et al.  Finding Structure in Reinforcement Learning , 1994, NIPS.

[25]  Leslie Pack Kaelbling,et al.  Planning under Time Constraints in Stochastic Domains , 1993, Artif. Intell..

[26]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[27]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[28]  Thomas Dean,et al.  Decomposition Techniques for Planning in Stochastic Domains , 1995, IJCAI.

[29]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[30]  Thomas G. Dietterich,et al.  High-Performance Job-Shop Scheduling With A Time-Delay TD(λ) Network , 1995, NIPS 1995.

[31]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[32]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[33]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[34]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[35]  Thomas G. Dietterich,et al.  High-Performance Job-Shop Scheduling With A Time-Delay TD-lambda Network , 1995, NIPS.

[36]  Corso Elvezia Hq-learning: Discovering Markovian Subgoals for Non-markovian Reinforcement Learning , 1996 .

[37]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[38]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[39]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[40]  Csaba Szepesv Ari,et al.  Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms , 1996 .

[41]  David Andre,et al.  Generalized Prioritized Sweeping , 1997, NIPS.

[42]  Daishi Harada,et al.  Reinforcement Learning with Time , 1997, AAAI/IAAI.

[43]  Eric A. Hansen,et al.  An Improved Policy Iteration Algorithm for Partially Observable MDPs , 1997, NIPS.

[44]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[45]  Robert Givan,et al.  Model Minimization in Markov Decision Processes , 1997, AAAI/IAAI.

[46]  Shieu-Hong Lin,et al.  Exploiting structure for planning and control , 1997 .

[47]  Robert Givan,et al.  Model Reduction Techniques for Computing Approximately Optimal Solutions for Markov Decision Processes , 1997, UAI.

[48]  Wenju Liu,et al.  Region-Based Approximations for Planning in Stochastic Domains , 1997, UAI.

[49]  Milos Hauskrecht,et al.  Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[50]  R. Sutton Between MDPs and Semi-MDPs : Learning , Planning , and Representing Knowledge at Multiple Temporal Scales , 1998 .

[51]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[52]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.