The policy iteration algorithm for average reward Markov decision processes with general state space

The average cost optimal control problem is addressed for Markov decision processes with unbounded cost. It is found that the policy iteration algorithm generates a sequence of policies which are c-regular, where c is the cost function under consideration. This result only requires the existence of an initial c-regular policy and an irreducibility condition on the state space. Furthermore, under these conditions the sequence of relative value functions generated by the algorithm is bounded from below and "nearly" decreasing, from which it follows that the algorithm is always convergent. Under further conditions, it is shown that the algorithm does compute a solution to the optimality equations and hence an optimal average cost policy. These results provide elementary criteria for the existence of optimal policies for Markov decision processes with unbounded cost and recover known results for the standard linear-quadratic-Gaussian problem. In particular, in the control of multiclass queueing networks, it is found that there is a close connection between optimization of the network and optimal control of a far simpler fluid network model.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  Onésimo Hernández-Lerma,et al.  Controlled Markov Processes , 1965 .

[3]  C. Derman DENUMERABLE STATE MARKOVIAN DECISION PROCESSES: AVERAGE COST CRITERION. , 1966 .

[4]  Huibert Kwakernaak,et al.  Linear Optimal Control Systems , 1972 .

[5]  Arie Hordijk,et al.  Dynamic programming and Markov potential theory , 1974 .

[6]  E. Nummelin General irreducible Markov chains and non-negative operators: List of symbols and notation , 1984 .

[7]  L. Sennott A new condition for the existence of optimal stationary policies in average cost Markov decision processes , 1986 .

[8]  R. Dekker Counter examples for compact action Markov decision chains with average reward criteria , 1987 .

[9]  Martin L. Puterman,et al.  On the Convergence of Policy Iteration in Finite State Undiscounted Markov Decision Processes: The Unichain Case , 1987, Math. Oper. Res..

[10]  R. Weber,et al.  Optimal control of service rates in networks of queues , 1987, Advances in Applied Probability.

[11]  M. Kurano LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES , 1987 .

[12]  Linn I. Sennott,et al.  Average Cost Optimal Stationary Policies in Infinite State Markov Decision Processes with Unbounded Costs , 1989, Oper. Res..

[13]  P. Glynn A Lyapunov Bound for Solutions of Poisson's Equation , 1989 .

[15]  P. Whittle Risk-Sensitive Optimal Control , 1990 .

[16]  Marie Duflo Méthodes récursives aléatoires , 1990 .

[17]  E. Nummelin On the Poisson equation in the potential theory of a single kernel. , 1991 .

[18]  V. Borkar Topics in controlled Markov chains , 1991 .

[19]  O. Hernández-Lerma,et al.  Recurrence conditions for Markov decision processes with Borel state space: A survey , 1991 .

[20]  Linn I. Sennott,et al.  Optimal Stationary Policies in General State Space Markov Decision Chains with Finite Action Sets , 1992, Math. Oper. Res..

[21]  Sean P. Meyn,et al.  Generalized Resolvents and Harris Recurrence of Markov Processes , 1992 .

[22]  James Randolph Perkins Control of push and pull manufacturing systems , 1993 .

[23]  S. Meyn,et al.  Stability of Markovian processes III: Foster–Lyapunov criteria for continuous-time processes , 1993, Advances in Applied Probability.

[24]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[25]  M. K. Ghosh,et al.  Discrete-time controlled Markov processes with average cost criterion: a survey , 1993 .

[26]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[27]  Sean P. Meyn,et al.  Stability of Generalized Jackson Networks , 1994 .

[28]  Sean P. Meyn,et al.  Duality and linear programs for stability and performance analysis of queueing networks and scheduling policies , 1994, Proceedings of 1994 33rd IEEE Conference on Decision and Control.

[29]  Sean P. Meyn Transience of Multiclass Queueing Networks Via Fluid Limit Models , 1995 .

[30]  Gideon Weiss,et al.  On optimal draining of re-entrant fluid lines , 1995 .

[31]  S. Meyn,et al.  Exponential and Uniform Ergodicity of Markov Processes , 1995 .

[32]  Florin Avram,et al.  Fluid models of sequencing problems in open queueing networks; an optimal control approach , 1995 .

[33]  J. Dai On Positive Harris Recurrence of Multiclass Queueing Networks: A Unified Approach Via Fluid Limit Models , 1995 .

[34]  Sean P. Meyn,et al.  Stability and convergence of moments for multiclass queueing networks via fluid limit models , 1995, IEEE Trans. Autom. Control..

[35]  Sunil Kumar,et al.  Fluctuation smoothing policies are stable for stochastic re-entrant lines , 1996, Discret. Event Dyn. Syst..

[36]  Gideon Weiss,et al.  Stability and Instability of Fluid Models for Reentrant Lines , 1996, Math. Oper. Res..

[37]  Sean P. Meyn,et al.  Duality and linear programs for stability and performance analysis of queuing networks and scheduling policies , 1996, IEEE Trans. Autom. Control..

[38]  Linn I. Sennott,et al.  The convergence of value iteration in average cost Markov decision chains , 1996, Oper. Res. Lett..

[39]  Sean P. Meyn,et al.  Fluid Network Models: Linear Programs for Control and Performance Bounds , 1996 .

[40]  Rolando Cavazos-Cadena,et al.  Value iteration in a class of average controlled Markov chains with unbounded costs: necessary and sufficient conditions for pointwise convergence , 1996, Journal of Applied Probability.

[41]  R. Cavazos-Cadena Value Iteration in a Class of Communicating Markov Decision Chains with the Average Cost Criterion , 1996 .

[42]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[43]  Sean P. Meyn,et al.  A Liapounov bound for solutions of the Poisson equation , 1996 .

[44]  O. Hernández-Lerma,et al.  Policy Iteration for Average Cost Markov Control Processes on Borel Spaces , 1997 .

[45]  O. Hernández-Lerma,et al.  Discrete-time Markov control processes , 1999 .

[46]  Ann Appl,et al.  On the Positive Harris Recurrence for Multiclass Queueing Networks: a Uniied Approach via Uid Limit Models , 1999 .

[47]  Sean P. Meyn,et al.  Value iteration and optimization of multiclass queueing networks , 1999, Queueing Syst. Theory Appl..