Learning state and action space hierarchies for reinforcement learning using action-dependent partitioning

Autonomous systems are often difficult to program. Reinforcement learning (RL) is an attractive alternative, as it allows the agent to learn behavior on the basis of sparse, delayed reward signals provided only when the agent reaches desired goals. Recent attempts to address the dimensionality of RL have turned to principled ways of exploiting temporal abstraction where decisions are not required at each step but rather invoke the execution of temporally-extended activities which follow their own policies until termination. This leads naturally to hierarchical control architectures and associated learning algorithms. This dissertation reviews several approaches to temporal abstraction and hierarchical organization that machine learning researchers have recently developed and presents a new method for the autonomous construction of hierarchical action and state representations in reinforcement learning, aimed at accelerating learning and extending the scope of such systems. In this approach, the agent uses information acquired while learning one task to discover iv subgoals for similar tasks. The agent is able to transfer knowledge to subsequent tasks and to accelerate learning by creating useful new subgoals and by off-line learning of corresponding subtask policies as abstract actions (options). At the same time, the subgoal actions are used to construct a more abstract state representation using action-dependent state space partitioning. This representation forms a new level in the state space hierarchy and serves as the initial representation for new learning tasks (the decision layer). In order to ensure that tasks are learnable, value functions are built simultaneously at different levels of the hierarchy and inconsistencies are used to identify actions to be used to refine relevant portions of the abstract state space. This representation serves as a first layer of the hierarchy. In order to estimate the structure of the state space for learning future tasks, the decision layer is constructed based on an estimate of the expected time to learn a new task and the system's experience with previously learned tasks. Together, these techniques permit the agent to form more abstract action and state representations over time. Experiments in deterministic and stochastic domains show that the presented method can significantly outperform learning on a flat state space representation.

[1]  Roderic A. Grupen,et al.  Learning to Coordinate Controllers - Reinforcement Learning on a Control Basis , 1997, IJCAI.

[2]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[3]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[4]  G. Tesauro Practical Issues in Temporal Difference Learning , 1992 .

[5]  Robert Givan,et al.  Model Reduction Techniques for Computing Approximately Optimal Solutions for Markov Decision Processes , 1997, UAI.

[6]  David Harel,et al.  Statecharts: A Visual Formalism for Complex Systems , 1987, Sci. Comput. Program..

[7]  Sebastian Thrun,et al.  Finding Structure in Reinforcement Learning , 1994, NIPS.

[8]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[9]  R. A. Brooks,et al.  Intelligence without Representation , 1991, Artif. Intell..

[10]  R. Korf Learning to solve problems by searching for macro-operators , 1983 .

[11]  Pattie Maes,et al.  Emergent Hierarchical Control Structures: Learning Reactive/Hierarchical Relationships in Reinforcement Environments , 1996 .

[12]  Chris Drummond Using a Case Base of Surfaces to Speed-Up Reinforcement Learning , 1997, ICCBR.

[13]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[14]  Maja J. Matarić,et al.  Learning to Use Selective Attention and Short-Term Memory in Sequential Tasks , 1996 .

[15]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[16]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[17]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[18]  Robert Givan,et al.  Equivalence notions and model minimization in Markov decision processes , 2003, Artif. Intell..

[19]  Manfred Huber,et al.  Autonomous Subgoal Discovery and Hierarchical Abstraction for Reinforcement Learning Using Monte Carlo Method , 2005, AAAI.

[20]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[21]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[22]  Andrew G. Barto,et al.  Heuristic Search in Infinite State Spaces Guided by Lyapunov Analysis , 2001, IJCAI.

[23]  M. Huber,et al.  Accelerating Action Dependent Hierarchical Reinforcement Learning Through Autonomous Subgoal Discovery , 2005 .

[24]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[25]  Martin L. Puterman,et al.  Discounted Markov Decision Problems , 2008 .

[26]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[27]  Manfred Huber,et al.  Subgoal Discovery for Hierarchical Reinforcement Learning Using Learned Policies , 2003 .

[28]  IT Kee-EungKim Solving Factored MDPs Using Non-homogeneous Partitions , 1998 .

[29]  Doina Precup,et al.  Between MOPs and Semi-MOP: Learning, Planning & Representing Knowledge at Multiple Temporal Scales , 1998 .

[30]  Glenn A. Iba,et al.  A heuristic approach to the discovery of macro-operators , 2004, Machine Learning.

[31]  Ronald E. Parr,et al.  Hierarchical control and learning for markov decision processes , 1998 .

[32]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[33]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[34]  Richard Fikes,et al.  Learning and Executing Generalized Robot Plans , 1993, Artif. Intell..

[35]  Thomas G. Dietterich An Overview of MAXQ Hierarchical Reinforcement Learning , 2000, SARA.

[36]  Abhijit Gosavi,et al.  Self-Improving Factory Simulation using Continuous-time Average-Reward Reinforcement Learning , 2007 .

[37]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.