Hierarchical reinforcement learning in continuous state and multi-agent environments

This dissertation investigates the use of hierarchy and abstraction as a means of solving complex sequential decision making problems such as those with continuous state and/or continuous action spaces, and domains with multiple cooperative agents. This thesis develops several novel extensions to hierarchical reinforcement learning (HRL), and designs algorithms that are appropriate for such problems. It has been shown that the average reward optimality criterion is more natural than the more commonly used discounted criterion for continuing tasks. This thesis investigates two formulations of HRL based on the average reward semi-Markov decision process (SMDP) model, both for discrete-time and continuous-time. These formulations correspond to two notions of optimality that have been explored in previous work on HRL: hierarchical optimality and recursive optimality. Novel discrete-time and continuous-time algorithms, termed hierarchically optimal average reward RL (HAR) and recursively optimal average reward RL (RAR) are presented, which learn to find hierarchically and recursively optimal average reward policies. Two automated guided vehicle (AGV) scheduling problems are used as experimental testbeds to empirically study the performance of the proposed algorithms. Policy gradient reinforcement learning (PGRL) methods have several advantages over the more traditional value function RL algorithms in solving problems with continuous state spaces. However, they suffer from slow convergence. This thesis defines a family of hierarchical policy gradient RL (HPGRL) algorithms for scaling PGRL methods to high-dimensional domains. This thesis also examines the use of HRL to accelerate policy learning in cooperative multi-agent tasks. The use of hierarchy speeds up learning in multi-agent domains by making it possible to learn coordination skills at the level of subtasks instead of primitive actions. Subtask-level coordination allows for increased cooperation skills as agents do not get confused by low-level details. A framework for hierarchical multi-agent RL is developed and an algorithm called Cooperative HRL is presented that solves cooperative multi-agent problems more efficiently. (Abstract shortened by UMI.)

[1]  Vivek S. Borkar,et al.  Learning Algorithms for Markov Decision Processes with Average Cost , 2001, SIAM J. Control. Optim..

[2]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[3]  Shigenobu Kobayashi,et al.  Reinforcement Learning by Stochastic Hill Climbing on Discounted Reward , 1995, ICML.

[4]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[5]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[6]  Andrew Y. Ng,et al.  Shaping and policy search in reinforcement learning , 2003 .

[7]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[8]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[9]  Prasad Tadepalli,et al.  Model-Based Average Reward Reinforcement Learning , 1998, Artif. Intell..

[10]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[11]  Prasad Tadepalli,et al.  Auto-Exploratory Average Reward Reinforcement Learning , 1996, AAAI/IAAI, Vol. 1.

[12]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[13]  Model-based Hierarchical Average-reward Reinforcement Learning , 2002, ICML.

[14]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[15]  S. Shankar Sastry,et al.  Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[16]  David Andre,et al.  State abstraction for programmable reinforcement learning agents , 2002, AAAI/IAAI.

[17]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[18]  Luis E. Ortiz,et al.  Nash Propagation for Loopy Graphical Games , 2002, NIPS.

[19]  Bernhard Hengst,et al.  Discovering Hierarchy in Reinforcement Learning with HEXQ , 2002, ICML.

[20]  Kee-Eung Kim,et al.  Learning to Cooperate via Policy Search , 2000, UAI.

[21]  Michael L. Littman,et al.  Graphical Models for Game Theory , 2001, UAI.

[22]  Craig Boutilier,et al.  Sequential Optimality and Coordination in Multiagent Systems , 1999, IJCAI.

[23]  Gerald Tesauro,et al.  TD-Gammon: A Self-Teaching Backgammon Program , 1995 .

[24]  P. Varaiya,et al.  Multilayer control of large Markov chains , 1978 .

[25]  Daphne Koller,et al.  Policy Iteration for Factored MDPs , 2000, UAI.

[26]  Suchi Saria,et al.  Probabilistic Plan Recognition in Multiagent Systems , 2004, ICAPS.

[27]  Sridhar Mahadevan,et al.  Hierarchically Optimal Average Reward Reinforcement Learning , 2002, ICML.

[28]  Ronald E. Parr,et al.  Hierarchical control and learning for markov decision processes , 1998 .

[29]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[30]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[31]  Craig A. Knoblock Learning Abstraction Hierarchies for Problem Solving , 1990, AAAI.

[32]  C. SIAMJ. A NEW VALUE ITERATION METHOD FOR THE AVERAGE COST DYNAMIC PROGRAMMING PROBLEM∗ , 1995 .

[33]  Milind Tambe,et al.  Distributed Sensor Networks: A Multiagent Perspective , 2003 .

[34]  Andrew G. Barto,et al.  PolicyBlocks: An Algorithm for Creating Useful Macro-Actions in Reinforcement Learning , 2002, ICML.

[35]  Xi-Ren Cao,et al.  Perturbation analysis of discrete event dynamic systems , 1991 .

[36]  Victor R. Lesser,et al.  Learning to Improve Coordinated Actions in Cooperative Distributed Problem-Solving Environments , 1998, Machine Learning.

[37]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[38]  Shie Mannor,et al.  Dynamic abstraction in reinforcement learning via clustering , 2004, ICML.

[39]  Benjamin Van Roy,et al.  The linear programming approach to approximate dynamic programming: theory and application , 2002 .

[40]  Rodney A. Brooks,et al.  A Robust Layered Control Syste For A Mobile Robot , 2022 .

[41]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[42]  Earl D. Sacerdoti,et al.  Planning in a Hierarchy of Abstraction Spaces , 1974, IJCAI.

[43]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[44]  Hassan K. Khalil,et al.  Singular perturbation methods in control : analysis and design , 1986 .

[45]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[46]  Benjamin Van Roy Learning and value function approximation in complex decision processes , 1998 .

[47]  Satinder Singh Transfer of learning by composing solutions of elemental sequential tasks , 2004, Machine Learning.

[48]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[49]  Richard E. Korf,et al.  Macro-Operators: A Weak Method for Learning , 1985, Artif. Intell..

[50]  C. Watkins Learning from delayed rewards , 1989 .

[51]  Carlos Guestrin,et al.  Max-norm Projections for Factored MDPs , 2001, IJCAI.

[52]  Victor R. Lesser,et al.  Multi-agent policies: from centralized ones to decentralized ones , 2002, AAMAS '02.

[53]  Leslie Pack Kaelbling,et al.  Learning to Achieve Goals , 1993, IJCAI.

[54]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[55]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[56]  Herbert A. Simon,et al.  The Sciences of the Artificial , 1970 .

[57]  Julie A. Adams,et al.  Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence , 2001, AI Mag..

[58]  Paul R. Cohen,et al.  Searching for Planning Operators with Context-Dependent and Probabilistic Effects , 1996, AAAI/IAAI, Vol. 1.

[59]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[60]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[61]  Dimitri P. Bertsekas,et al.  Neuro-Dynamic Programming , 2009, Encyclopedia of Optimization.

[62]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[63]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[64]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[65]  Allen Newell,et al.  Chunking in Soar: The anatomy of a general learning mechanism , 1985, Machine Learning.

[66]  Tucker R. Balch,et al.  Behavior-based formation control for multirobot teams , 1998, IEEE Trans. Robotics Autom..

[67]  Sridhar Mahadevan,et al.  Continuous-Time Hierarchical Reinforcement Learning , 2001, ICML.

[68]  Andrew G. Barto,et al.  A causal approach to hierarchical decomposition of factored MDPs , 2005, ICML.

[69]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[70]  Satinder Singh,et al.  An Efficient Exact Algorithm for Singly Connected Graphical Games , 2002, NIPS 2002.

[71]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[72]  Christos G. Cassandras,et al.  Introduction to Discrete Event Systems , 1999, The Kluwer International Series on Discrete Event Dynamic Systems.

[73]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[74]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[75]  Manuela M. Veloso,et al.  Team-partitioned, opaque-transition reinforcement learning , 1999, AGENTS '99.

[76]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[77]  Sridhar Mahadevan,et al.  Hierarchical Policy Gradient Algorithms , 2003, ICML.

[78]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[79]  Sridhar Mahadevan,et al.  Average reward reinforcement learning: Foundations, algorithms, and empirical results , 2004, Machine Learning.

[80]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[81]  SRIDHAR MAHADEVAN,et al.  Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results , 2005, Machine Learning.

[82]  Yishay Mansour,et al.  Nash Convergence of Gradient Dynamics in General-Sum Games , 2000, UAI.

[83]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[84]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[85]  Jun Morimoto,et al.  Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning , 2000, Robotics Auton. Syst..

[86]  Zhiyuan Ren,et al.  A time aggregation approach to Markov decision processes , 2002, Autom..

[87]  Vijay R. Konda,et al.  Actor-Critic Algorithms , 1999, NIPS.

[88]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[89]  Earl D. Sacerdott Planning in a hierarchy of abstraction spaces , 1973, IJCAI 1973.

[90]  Daphne Koller,et al.  Multi-Agent Influence Diagrams for Representing and Solving Games , 2001, IJCAI.

[91]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[92]  David Andre,et al.  Programmable Reinforcement Learning Agents , 2000, NIPS.

[93]  Abhijit Gosavi,et al.  Self-Improving Factory Simulation using Continuous-time Average-Reward Reinforcement Learning , 2007 .

[94]  Andrew W. Moore,et al.  Distributed Value Functions , 1999, ICML.

[95]  Gary L. Drescher,et al.  Made-up minds - a constructivist approach to artificial intelligence , 1991 .

[96]  Austin Tate,et al.  O-Plan: The open Planning Architecture , 1991, Artif. Intell..

[97]  ModelsSridhar,et al.  Designing Agent Controllers using Discrete-Event Markov , 2007 .

[98]  R. Howard Dynamic Programming and Markov Processes , 1960 .

[99]  Sridhar Mahadevan,et al.  Learning to Take Concurrent Actions , 2002, NIPS.

[100]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[101]  Svetha Venkatesh,et al.  Policy Recognition in the Abstract Hidden Markov Model , 2002, J. Artif. Intell. Res..

[102]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[103]  Pierfrancesco La Mura Game Networks , 2000, UAI.

[104]  Eduardo D. Sontag,et al.  Neural Networks for Control , 1993 .

[105]  Ronen I. Brafman,et al.  Modeling Agents as Qualitative Decision Makers , 1997, Artif. Intell..

[106]  Gang Wang,et al.  Hierarchical Optimization of Policy-Coupled Semi-Markov Decision Processes , 1999, ICML.

[107]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[108]  Shie Mannor,et al.  Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning , 2002, ECML.

[109]  Michail G. Lagoudakis,et al.  Coordinated Reinforcement Learning , 2002, ICML.

[110]  Daphne Koller,et al.  Multi-agent algorithms for solving graphical games , 2002, AAAI/IAAI.

[111]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[112]  Xin Wang,et al.  Batch Value Function Approximation via Support Vectors , 2001, NIPS.

[113]  Sebastian Thrun,et al.  Finding Structure in Reinforcement Learning , 1994, NIPS.

[114]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[115]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[116]  Maja J. Mataric,et al.  Reinforcement Learning in the Multi-Robot Domain , 1997, Auton. Robots.

[117]  Sridhar Mahadevan,et al.  Learning to communicate and act using hierarchical reinforcement learning , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[118]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[119]  Roderic A. Grupen,et al.  A feedback control structure for on-line learning tasks , 1997, Robotics Auton. Syst..

[120]  Milind Tambe,et al.  The Communicative Multiagent Team Decision Problem: Analyzing Teamwork Theories and Models , 2011, J. Artif. Intell. Res..

[121]  Andrew G. Barto,et al.  Using relative novelty to identify useful temporal abstractions in reinforcement learning , 2004, ICML.