论文信息 - Hierarchical Optimization of Policy-Coupled Semi-Markov Decision Processes

Hierarchical Optimization of Policy-Coupled Semi-Markov Decision Processes

One general strategy for approximately solving large Markov decision processes is \divide-and-conquer": the original problem is decomposed into sub-problems which interact with each other, but yet can be solved independently by taking into account the nature of the interaction. In this paper we focus on a class of \policy-coupled" semi-Markov decision processes (SMDPs), which arise in many nonstationary real-world multi-agent tasks, such as manufacturing and robotics. The nature of the interaction among sub-problems (agents) is more subtle than that studied previously: the components of a sub-SMDP, namely the available states and actions, transition probabilities and rewards, depend on the policies used in solving the \neighboring" sub-SMDPs. This \strongly-coupled" interaction among sub-problems causes the approach of solving each sub-SMDP in parallel to fail. We present a novel approach whereby many variants of each sub-SMDP are solved, explicitly taking into account the diierent modes of interaction , and a dynamic merging algorithm is used to combine the base level policies. We present detailed experimental results for a 12-machine transfer line, a large real-world manufacturing task. We show that the hierarchical approach is not only much faster than a ""at" algorithm, but also outperforms two well-known heuristics for running transfer lines used in many factories.

Gang Wang | Sridhar Mahadevan | S. Mahadevan | G. Wang

[1] Averill M. Law,et al. Simulation Modeling and Analysis , 1982 .

[2] Stanley Gershwin. A hierarchical framework for manufacturing systems scheduling: A two-machine example , 1987, 26th IEEE Conference on Decision and Control.

[3] Stanley B. Gershwin,et al. Simulation experience with a hierarchical scheduling policy for a simple manufacturing system , 1988, Proceedings of the 27th IEEE Conference on Decision and Control.

[4] David L. Woodruff,et al. CONWIP: a pull alternative to kanban , 1990 .

[5] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6] Thomas Dean,et al. Decomposition Techniques for Planning in Stochastic Domains , 1995, IJCAI.

[7] Andrew G. Barto,et al. Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[8] Dimitri P. Bertsekas,et al. Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[9] Asbjoern M. Bonvik,et al. A comparison of production-line control mechanisms , 1997 .

[10] Satinder P. Singh,et al. How to Dynamically Merge Markov Decision Processes , 1997, NIPS.

[11] Stuart J. Russell,et al. Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[12] Doina Precup,et al. Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[13] Doina Precup,et al. Intra-Option Learning about Temporally Abstract Actions , 1998, ICML.

[14] Kee-Eung Kim,et al. Solving Very Large Weakly Coupled Markov Decision Processes , 1998, AAAI/IAAI.

[15] Thomas G. Dietterich. The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[16] Ronald Parr,et al. Flexible Decomposition Algorithms for Weakly Coupled Markov Decision Problems , 1998, UAI.