论文信息 - Hierarchically Optimal Average Reward Reinforcement Learning

Hierarchically Optimal Average Reward Reinforcement Learning

Two notions of optimality have been explored in previous work on hierarchical reinforcement learning (HRL): hierarchical optimality, or the optimal policy in the space defined by a task hierarchy, and a weaker local model called recursive optimality. In this paper, we introduce two new average-reward HRL algorithms for finding hierarchically optimal policies. We compare them to our previously reported algorithms for computing recursively optimal policies, using a grid-world taxi problem and a more real-world AGV scheduling problem. The new algorithms are based on a three-part value function decomposition proposed recently by Andre and Russell, which generalizes Dietterich’s MAXQ value function decomposition. A key difference between the algorithms proposed in this paper and our previous work is that there is only a single global gain (average reward), instead of a gain for each subtask. Our results show the new average-reward algorithms have better performance than both the previous recursively optimal counterparts, as well as the corresponding discounted hierarchical optimal algorithms.

Sridhar Mahadevan | Mohammad Ghavamzadeh | M. Ghavamzadeh | S. Mahadevan

[1] Gang Wang,et al. Hierarchical Optimization of Policy-Coupled Semi-Markov Decision Processes , 1999, ICML.

[2] Prasad Tadepalli,et al. Auto-Exploratory Average Reward Reinforcement Learning , 1996, AAAI/IAAI, Vol. 1.

[3] Vivek S. Borkar,et al. Learning Algorithms for Markov Decision Processes with Average Cost , 2001, SIAM J. Control. Optim..

[4] Michael O. Duff,et al. Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[5] David Andre,et al. State abstraction for programmable reinforcement learning agents , 2002, AAAI/IAAI.

[6] Thomas G. Dietterich. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[7] Sridhar Mahadevan,et al. Continuous-Time Hierarchical Reinforcement Learning , 2001, ICML.

[8] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[9] Ronald E. Parr,et al. Hierarchical control and learning for markov decision processes , 1998 .

[10] Anton Schwartz,et al. A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[11] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[12] Sridhar Mahadevan,et al. Average reward reinforcement learning: Foundations, algorithms, and empirical results , 2004, Machine Learning.