论文信息 - An improved algorithm for solving communicating average reward Markov decision processes

An improved algorithm for solving communicating average reward Markov decision processes

This paper provides a policy iteration algorithm for solving communicating Markov decision processes (MDPs) with average reward criterion. The algorithm is based on the result that for communicating MDPs there is an optimal policy which is unichain. The improvement step is modified to select only unichain policies; consequently the nested optimality equations of Howard's multichain policy iteration algorithm are avoided. Properties and advantages of the algorithm are discussed and it is incorporated into a decomposition algorithm for solving multichain MDPs. Since it is easier to show that a problem is communicating than unichain we recommend use of this algorithm instead of unichain policy iteration.

M. Puterman | M. Haviv

[1] Ronald A. Howard,et al. Dynamic Programming and Markov Processes , 1960 .

[2] D. Blackwell. Discrete Dynamic Programming , 1962 .

[3] C. Derman. DENUMERABLE STATE MARKOVIAN DECISION PROCESSES: AVERAGE COST CRITERION. , 1966 .

[4] Bennett L. Fox,et al. Scientific Applications: An algorithm for identifying the ergodic subchains and transient states of a stochastic matrix , 1967, Commun. ACM.

[5] J. Bather. Optimal decision procedures for finite Markov chains. Part II: Communicating systems , 1973, Advances in Applied Probability.

[6] Martin L. Puterman,et al. On the Convergence of Policy Iteration in Finite State Undiscounted Markov Decision Processes: The Unichain Case , 1987, Math. Oper. Res..

[7] Katsuhisa Ohno,et al. Computing Optimal Policies for Controlled Tandem Queueing Systems , 1987, Oper. Res..

[8] J. Filar,et al. Communicating MDPs: Equivalence and LP properties , 1988 .

[9] Peter W. Jones,et al. Stochastic Modelling and Analysis , 1988 .

[10] Keith W. Ross,et al. Multichain Markov Decision Processes with a Sample Path Constraint: A Decomposition Approach , 1991, Math. Oper. Res..