A two-phase time aggregation algorithm for average cost Markov decision processes

This paper introduces a two-phase approach to solve average cost Markov decision processes, which is based on state space embedding or time aggregation. In the first phase, time aggregation is applied for policy evaluation in a prescribed subset of the state space, and a novel result is applied to expand the evaluation to the whole state space. This evaluation is then used in the second phase in a policy improvement step, and the two phases are then sequentially applied until convergence is attained or a prescribed running time is exceeded.

[1]  Peter B. Luh,et al.  Incremental Value Iteration for Time-Aggregated Markov-Decision Processes , 2007, IEEE Transactions on Automatic Control.

[2]  Marcelo D. Fragoso,et al.  Approximate dynamic programming via direct search in the space of value function approximations , 2011, Eur. J. Oper. Res..

[3]  Adam Shwartz,et al.  Exact finite approximations of average-cost countable Markov decision processes , 2007, Autom..

[4]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  Zhiyuan Ren,et al.  A time aggregation approach to Markov decision processes , 2002, Autom..

[7]  Hui Peng,et al.  A Survey of Approximate Dynamic Programming , 2009, 2009 International Conference on Intelligent Human-Machine Systems and Cybernetics.

[8]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[9]  Steven I. Marcus,et al.  Simulation-based Algorithms for Markov Decision Processes/ Hyeong Soo Chang ... [et al.] , 2013 .

[10]  Marcelo D. Fragoso,et al.  Time aggregated Markov decision processes via standard dynamic programming , 2011, Oper. Res. Lett..

[11]  Jiaqi Zhang,et al.  A semi-Markov model with holdout transshipment policy and phase-type exponential lead time , 2011, Eur. J. Oper. Res..

[12]  Bart De Schutter,et al.  Approximate dynamic programming with a fuzzy parameterization , 2010, Autom..

[13]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[14]  E. Fainberg Sufficient Classes of Strategies in Discrete Dynamic Programming I: Decomposition of Randomized Strategies and Embedded Models , 1987 .

[15]  Xi-Ren Cao,et al.  Lebesgue-Sampling-Based Optimal Control Problems With Time Aggregation , 2011, IEEE Transactions on Automatic Control.

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Zhiyuan Ren,et al.  Markov decision Processes with fractional costs , 2005, IEEE Transactions on Automatic Control.

[18]  Warren B. Powell,et al.  Adaptive Stochastic Control for the Smart Grid , 2011, Proceedings of the IEEE.

[19]  Edilson Fernandes de Arruda,et al.  Stability and optimality of a multi-product production and storage system under demand uncertainty , 2008, Eur. J. Oper. Res..

[20]  D. Bertsekas A New Value Iteration method for the Average Cost Dynamic Programming Problem , 1998 .

[21]  W. Marsden I and J , 2012 .

[22]  Ioannis Ch. Paschalidis,et al.  A Distributed Actor-Critic Algorithm and Applications to Mobile Sensor Network Coordination Problems , 2010, IEEE Transactions on Automatic Control.

[23]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.