Multitime scale Markov decision processes

This paper proposes a simple analytical model called M time scale Markov decision process (MMDPs) for hierarchically structured sequential decision making processes, where decisions in each level in the M-level hierarchy are made in M different discrete time scales. In this model, the state-space and the control-space of each level in the hierarchy are nonoverlapping with those of the other levels, respectively, and the hierarchy is structured in a "pyramid" sense such that a decision made at level m (slower time scale) state and/or the state will affect the evolutionary decision making process of the lower level m+1 (faster time scale) until a new decision is made at the higher level but the lower level decisions themselves do not affect the transition dynamics of higher levels. The performance produced by the lower level decisions will affect the higher level decisions. A hierarchical objective function is defined such that the finite-horizon value of following a (nonstationary) policy at level m+1 over a decision epoch of level m plus an immediate reward at level m is the single-step reward for the decision making process at level m. From this we define "multi-level optimal value function" and derive "multi-level optimality equation." We discuss how to solve MMDPs exactly and study some approximation methods, along with heuristic sampling-based schemes, to solve MMDPs.

[1]  Magdi S. Mahmoud,et al.  Multilevel Systems Control and Applications: A Survey , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  P. Varaiya,et al.  Multilayer control of large Markov chains , 1978 .

[3]  T. Başar,et al.  Dynamic Noncooperative Game Theory , 1982 .

[4]  Jean Walrand,et al.  Extensions of the multiarmed bandit problem: The discounted case , 1985 .

[5]  Gabriel R. Bitran,et al.  Production Planning of Style Goods with High Setup Costs and Forecast Revisions , 1986, Oper. Res..

[6]  Stanley B. Gershwin,et al.  Hierarchical flow control: a framework for scheduling and planning discrete events in manufacturing systems , 1989, Proc. IEEE.

[7]  O. Hernández-Lerma Adaptive Markov Control Processes , 1989 .

[8]  O. Hernández-Lerma,et al.  Error bounds for rolling horizon policies in discrete-time Markov control processes , 1990 .

[9]  Kishor S. Trivedi,et al.  A methodology for formal expression of hierarchy in model solution , 1993, Proceedings of 5th International Workshop on Petri Nets and Performance Models.

[10]  Wolfgang Fischer,et al.  The Markov-Modulated Poisson Process (MMPP) Cookbook , 1993, Perform. Evaluation.

[11]  M. K. Ghosh,et al.  Discrete-time controlled Markov processes with average cost criterion: a survey , 1993 .

[12]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[13]  Qing Zhang,et al.  Hierarchical Decision Making in Stochastic Manufacturing Systems , 1994 .

[14]  T. Başar,et al.  Multi-time scale zero-sum differential games with perfect state measurements , 1995 .

[15]  Walter Willinger,et al.  Self-Similarity in High-Speed Packet Traffic: Analysis and Modeling of Ethernet Traffic Measurements , 1995 .

[16]  John N. Tsitsiklis,et al.  Statistical Multiplexing of Multiple Time-Scale Markov Streams , 1995, IEEE J. Sel. Areas Commun..

[17]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[18]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[19]  Ralph Neuneier,et al.  Optimal Asset Allocation using Adaptive Dynamic Programming , 1995, NIPS.

[20]  Kishor S. Trivedi,et al.  Markov Dependability Models of Complex Systems: Analysis Techniques , 1996 .

[21]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[22]  Alfred Müller,et al.  How Does the Value Function of a Markov Decision Process Depend on the Transition Probabilities? , 1997, Math. Oper. Res..

[23]  Sean P. Meyn The policy iteration algorithm for average reward Markov decision processes with general state space , 1997, IEEE Trans. Autom. Control..

[24]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[25]  Dimitri P. Bertsekas,et al.  Rollout Algorithms for Stochastic Scheduling Problems , 1999, J. Heuristics.

[26]  Yong-Pin Zhou,et al.  A Single-Server Queue with Markov Modulated Service Times , 1999 .

[27]  Kihong Park,et al.  Multiple Time Scale Congestion Control for Self-Similar Network Traffic , 1999, Perform. Evaluation.

[28]  Nicola Secomandi,et al.  Comparing neuro-dynamic programming algorithms for the vehicle routing problem with stochastic demands , 2000, Comput. Oper. Res..

[29]  Robert Givan,et al.  A framework for simulation-based network control via hindsight optimization , 2000, Proceedings of the 39th IEEE Conference on Decision and Control (Cat. No.00CH37187).

[30]  Robert Givan,et al.  On-line Scheduling via Sampling , 2000, AIPS.

[31]  Kishor S. Trivedi,et al.  Stochastic Modeling Formalisms for Dependability, Performance and Performability , 2000, Performance Evaluation.

[32]  Bruce H. Krogh,et al.  Mode-matching control policies for multi-mode Markov decision processes , 2001, Proceedings of the 2001 American Control Conference. (Cat. No.01CH37148).

[33]  Gang Wu,et al.  Congestion control via online sampling , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[34]  P. Glynn,et al.  Hoeffding's inequality for uniformly ergodic Markov chains , 2002 .

[35]  Noah Gans,et al.  Managing Learning and Turnover in Employee Staffing , 1999, Oper. Res..

[36]  David Tse,et al.  A time-scale decomposition approach to measurement-based admission control , 2003, TNET.

[37]  Vishal Sharma,et al.  Framework for Multi-Protocol Label Switching (MPLS)-based Recovery , 2003, RFC.

[38]  S. Marcus,et al.  Approximate receding horizon approach for Markov decision processes: average reward case , 2003 .

[39]  Robert Givan,et al.  Parallel Rollout for Online Solution of Partially Observable Markov Decision Processes , 2004, Discret. Event Dyn. Syst..

[40]  Satinder Singh,et al.  An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[41]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[42]  Michael C. Fu,et al.  An Adaptive Sampling Algorithm for Solving Markov Decision Processes , 2005, Oper. Res..