Hierarchical Optimization of Policy-Coupled Semi-Markov Decision Processes

One general strategy for approximately solving large Markov decision processes is \divide-and-conquer": the original problem is decomposed into sub-problems which interact with each other, but yet can be solved independently by taking into account the nature of the interaction. In this paper we focus on a class of \policy-coupled" semi-Markov decision processes (SMDPs), which arise in many nonstationary real-world multi-agent tasks, such as manufacturing and robotics. The nature of the interaction among sub-problems (agents) is more subtle than that studied previously: the components of a sub-SMDP, namely the available states and actions, transition probabilities and rewards, depend on the policies used in solving the \neighboring" sub-SMDPs. This \strongly-coupled" interaction among sub-problems causes the approach of solving each sub-SMDP in parallel to fail. We present a novel approach whereby many variants of each sub-SMDP are solved, explicitly taking into account the diierent modes of interaction , and a dynamic merging algorithm is used to combine the base level policies. We present detailed experimental results for a 12-machine transfer line, a large real-world manufacturing task. We show that the hierarchical approach is not only much faster than a ""at" algorithm, but also outperforms two well-known heuristics for running transfer lines used in many factories.

[1]  Averill M. Law,et al.  Simulation Modeling and Analysis , 1982 .

[2]  Stanley Gershwin A hierarchical framework for manufacturing systems scheduling: A two-machine example , 1987, 26th IEEE Conference on Decision and Control.

[3]  Stanley B. Gershwin,et al.  Simulation experience with a hierarchical scheduling policy for a simple manufacturing system , 1988, Proceedings of the 27th IEEE Conference on Decision and Control.

[4]  David L. Woodruff,et al.  CONWIP: a pull alternative to kanban , 1990 .

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  Thomas Dean,et al.  Decomposition Techniques for Planning in Stochastic Domains , 1995, IJCAI.

[7]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[8]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[9]  Asbjoern M. Bonvik,et al.  A comparison of production-line control mechanisms , 1997 .

[10]  Satinder P. Singh,et al.  How to Dynamically Merge Markov Decision Processes , 1997, NIPS.

[11]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[12]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[13]  Doina Precup,et al.  Intra-Option Learning about Temporally Abstract Actions , 1998, ICML.

[14]  Kee-Eung Kim,et al.  Solving Very Large Weakly Coupled Markov Decision Processes , 1998, AAAI/IAAI.

[15]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[16]  Ronald Parr,et al.  Flexible Decomposition Algorithms for Weakly Coupled Markov Decision Problems , 1998, UAI.