Optimization of Average Rewards of Time Nonhomogeneous Markov Chains

We study the optimization of average rewards of discrete time nonhomogeneous Markov chains, in which the state spaces, transition probabilities, and reward functions depend on time. The analysis encounters a few major difficulties: 1) Notions crucial to homogeneous Markov chains, such as ergodicity, stationarity, periodicity, and connectivity, no longer apply; 2) The average reward criterion is under-selective; i.e, it does not depend on the decisions in any finite period, and thus dynamic programming is not amenable; and 3) Because of the under-selectivity, an optimal average-reward policy may not be the best in any finite period. These issues are resolved by 1) We discover that a new notion, called “confluencity”, is the base for optimization of average rewards of Markov chains. Confluencity refers to the property that two independent sample paths of a Markov chain starting from any two different initial states will eventually meet together; 2) We apply the direct-comparison based approach [3] to the average reward optimization and obtain the necessary and sufficient conditions for optimal policies; and 3) We study the bias optimality with bias measuring the transient reward; we show that for the transient reward to be optimal, one additional condition based on bias potentials is required.

[1]  M. Bartlett,et al.  Weak ergodicity in non-homogeneous Markov chains , 1958, Mathematical Proceedings of the Cambridge Philosophical Society.

[2]  K. Hinderer,et al.  Foundations of Non-stationary Dynamic Programming with Discrete Time Parameter , 1970 .

[3]  D. Griffeath Uniform coupling of non-homogeneous Markov chains , 1975, Journal of Applied Probability.

[4]  D. Griffeath A maximal coupling for Markov chains , 1975 .

[5]  J. Pitman On coupling of Markov chains , 1976 .

[6]  Robert L. Smith,et al.  A New Optimality Criterion for Nonhomogeneous Markov Decision Processes , 1987, Oper. Res..

[7]  J. C. Bean,et al.  Denumerable state nonhomogeneous Markov decision processes , 1990 .

[8]  Robert L. Smith,et al.  Optimal average value convergence in nonhomogeneous Markov decision processes Yunsun Park, James C. Bean and Robert L. Smith. , 1993 .

[9]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[10]  I. Sonin The asymptotic behaviour of a general finite nonhomogeneous Markov chain (the decomposition-separation theorem) , 1996 .

[11]  W. Fleming Book Review: Discrete-time Markov control processes: Basic optimality criteria , 1997 .

[12]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[13]  N. Limnios,et al.  Hitting time in a finite non-homogeneous Markov chain with applications , 1998 .

[14]  Martin L. Puterman,et al.  A probabilistic analysis of bias optimality in unichain Markov decision processes , 2001, IEEE Trans. Autom. Control..

[15]  Xi-Ren Cao,et al.  BIAS OPTIMALITY FOR MULTICHAIN MARKOV DECISION PROCESSES , 2005 .

[16]  Xi-Ren Cao,et al.  Event-Based Optimization of Markov Systems , 2008, IEEE Transactions on Automatic Control.

[17]  Xi-Ren Cao,et al.  The $n$th-Order Bias Optimality for Multichain Markov Decision Processes , 2008, IEEE Transactions on Automatic Control.

[18]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annual Reviews in Control.

[19]  L. Saloff-Coste,et al.  Merging and stability for time inhomogeneous finite Markov chains , 2010, 1004.2296.

[20]  Li Qiu,et al.  Partial-Information State-Based Optimization of Partially Observable Markov Decision Processes and the Separation Principle , 2014, IEEE Transactions on Automatic Control.