We consider a class of discrete time, dynamic decision-making models which we refer to as Periodically Time-Inhomogeneous Markov Decision Processes (PTMDPs). In these models, the decision-making horizon can be partitioned into intervals, calledslow scale cycles, ofN+1 epochs. The transition law and reward function are time-homogeneous over the firstN epochs of each slow scale cycle, but distinct at the final epoch. The motivation for such models is in applications where decisions of different nature are taken at different time scales, i.e., many "low-level" decisions are made between less frequent "high-level" ones.For the PTMDP model, we consider the problem of optimizing the expected discounted reward when rewards devalue by a discount factor ? at the beginning of each slow scale cycle. WhenN is large,initially stationary policies (i.s.p.'s) are natural candidates for optimal policies. Similar to turnpike policies, an initially stationary policy uses the same decision rule for some large number of epochs in each slow scale cycle, followed by a relatively short planning horizon of time-varying decision rules. In this paper, we characterize the form of the optimal value as a function ofN, establish conditions ensuring the existence of near-optimal i.s.p.'s, and characterize their structure. Our analysis deals separately with the cases where the time-homogeneous part of the system has state-dependent and state-independent optimal average reward. As we illustrate, the results in these two distinct cases are qualitatively different.
[1]
J. Bather.
Optimal decision procedures for finite Markov chains. Part III: General convex systems
,
1973
.
[2]
Mark H. Davis.
Markov Models and Optimization
,
1995
.
[3]
Paul J. Schweitzer,et al.
The Functional Equations of Undiscounted Markov Renewal Programming
,
1971,
Math. Oper. Res..
[4]
J. Shapiro.
Turnpike Planning Horizons for a Markovian Decision Model
,
1968
.
[5]
V. Quarles,et al.
Department of Electrical Engineering and Computer Science
,
1994
.
[6]
Nicole Bäuerle,et al.
Discounted Stochastic Fluid Programs
,
2001,
Math. Oper. Res..
[7]
P. Schweitzer,et al.
Geometric convergence of value-iteration in multichain Markov decision problems
,
1979,
Advances in Applied Probability.
[8]
François Delebecque,et al.
Optimal control of markov chains admitting strong and weak interactions
,
1981,
Autom..
[9]
Paul J. Schweitzer,et al.
A value-iteration scheme for undiscounted multichain Markov renewal programs
,
1984,
Z. Oper. Research.
[10]
Martin L. Puterman,et al.
Markov Decision Processes: Discrete Stochastic Dynamic Programming
,
1994
.
[11]
Paul J. Schweitzer,et al.
The Asymptotic Behavior of Undiscounted Value Iteration in Markov Decision Problems
,
1977,
Math. Oper. Res..
[12]
P. Kokotovic,et al.
A singular perturbation approach to modeling and control of Markov chains
,
1981
.