Optimization of Average Rewards of Time

We study the optimization of average rewards of discrete time nonhomogeneous Markov chains, in which the state spaces, transition probabilities, and reward functions depend on time. The analysis encounters a few major difficulties: 1) Notions crucial to homogeneous Markov chains, such as ergodicity, sta- tionarity, periodicity, and connectivity, no longer apply; 2) The average reward criterion is under-selective; i.e., it does not depend on the decisions in any finite period, and thus dynamic program- ming is not amenable; and 3) Because of the under-selectivity, an optimal average-reward policy may not be the best in any finite period. These issues are resolved by 1) We discover that a new notion, called "confluencity", is the base for optimization of average rewards of Markov chains. Confluencity refers to the property that two independent sample paths of a Markov chain starting from any two different initial states will eventually meet together; 2) We apply the direct-comparison based approach (3) to the average reward optimization and obtain the necessary and sufficient conditions for optimal policies; and 3) We study the bias optimality with bias measuring the transient reward; we show that for the transient reward to be optimal, one additional condition based on bias potentials is required.

[1]  Xi-Ren Cao,et al.  State Classification of Time-Nonhomogeneous Markov Chains and Average Reward Optimization of Multi-Chains , 2016, IEEE Transactions on Automatic Control.

[2]  L. Saloff-Coste,et al.  Merging and stability for time inhomogeneous finite Markov chains , 2010, 1004.2296.

[3]  Xi-Ren Cao,et al.  Event-Based Optimization of Markov Systems , 2008, IEEE Transactions on Automatic Control.

[4]  Xi-Ren Cao,et al.  The $n$th-Order Bias Optimality for Multichain Markov Decision Processes , 2008, IEEE Transactions on Automatic Control.

[5]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annu. Rev. Control..

[6]  Martin L. Puterman,et al.  A probabilistic analysis of bias optimality in unichain Markov decision processes , 2001, IEEE Trans. Autom. Control..

[7]  O. Hernández-Lerma,et al.  Discrete-time Markov control processes , 1999 .

[8]  N. Limnios,et al.  Hitting time in a finite non-homogeneous Markov chain with applications , 1998 .

[9]  I. Sonin The asymptotic behaviour of a general finite nonhomogeneous Markov chain (the decomposition-separation theorem) , 1996 .

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  Robert L. Smith,et al.  Optimal average value convergence in nonhomogeneous Markov decision processes Yunsun Park, James C. Bean and Robert L. Smith. , 1993 .

[12]  J. C. Bean,et al.  Denumerable state nonhomogeneous Markov decision processes , 1990 .

[13]  Robert L. Smith,et al.  A New Optimality Criterion for Nonhomogeneous Markov Decision Processes , 1987, Oper. Res..

[14]  J. Pitman On coupling of Markov chains , 1976 .

[15]  D. Griffeath Uniform coupling of non-homogeneous Markov chains , 1975, Journal of Applied Probability.

[16]  D. Griffeath A maximal coupling for Markov chains , 1975 .

[17]  K. Hinderer,et al.  Foundations of Non-stationary Dynamic Programming with Discrete Time Parameter , 1970 .

[18]  M. Bartlett,et al.  Weak ergodicity in non-homogeneous Markov chains , 1958, Mathematical Proceedings of the Cambridge Philosophical Society.