On the Empirical State-Action Frequencies in Markov Decision Processes Under General Policies

We consider the empirical state-action frequencies and the empirical reward in weakly communicating finite-state Markov decision processes under general policies. We define a certain polytope and establish that every element of this polytope is the limit of the empirical frequency vector, under some policy, in a strong sense. Furthermore, we show that the probability of exceeding a given distance between the empirical frequency vector and the polytope decays exponentially with time under every policy. We provide similar results for vector-valued empirical rewards.

[1]  H. D. Miller A Convexity Property in the Theory of Random Variables Defined on a Finite Markov Chain , 1961 .

[2]  W. M. Hirsch A strong law for the maximum cumulative sum of independent random variables , 1965 .

[3]  Cyrus Derman,et al.  Finite State Markovian Decision Processes , 1970 .

[4]  B. Hajek Hitting-time and occupation-time bounds implied by drift analysis with applications , 1982, Advances in Applied Probability.

[5]  John S. Edwards,et al.  Linear Programming and Finite Markovian Control Problems , 1983 .

[6]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[7]  E. Altman,et al.  Markov decision problems and state-action frequencies , 1991 .

[8]  N. Shimkin Extremal large deviations in controlled i.i.d. processes with applications to hypothesis testing , 1993, Advances in Applied Probability.

[9]  Eitan Altman,et al.  Rate of Convergence of Empirical Measures and Costs in Controlled Markov Chains and Transient Optimality , 1994, Math. Oper. Res..

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  Robert G. Gallager,et al.  Discrete Stochastic Processes , 1995 .

[12]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[13]  John N. Tsitsiklis,et al.  Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[14]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[15]  Amir Dembo,et al.  Large Deviations Techniques and Applications , 1998 .

[16]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[17]  S. Meyn,et al.  Multiplicative ergodicity and large deviations for an irreducible Markov chain , 2000 .

[18]  S. Balajiy,et al.  Multiplicative Ergodicity and Large Deviations for an Irreducible Markov Chain , 2000 .

[19]  P. Glynn,et al.  Hoeffding's inequality for uniformly ergodic Markov chains , 2002 .