New approximate dynamic programming algorithms for large-scale undiscounted Markov decision processes and their application to optimize a production and distribution system

Undiscounted Markov decision processes (UMDP's) can formulate optimal stochastic control problems that minimize the expected total cost per period for various systems. We propose new approximate dynamic programming (ADP) algorithms for large-scale UMDP's that can solve the curses of dimensionality. These algorithms, called simulation-based modified policy iteration (SBMPI) algorithms, are extensions of the simulation-based modified policy iteration method (SBMPIM) (Ohno, 2011) for optimal control problems of multistage JIT-based production and distribution systems with stochastic demand and production capacity. The main new concepts of the SBMPI algorithms are that the simulation-based policy evaluation step of the SBMPIM is replaced by the partial policy evaluation step of the modified policy iteration method (MPIM) and that the algorithms starts from the expected total cost per period and relative value estimated by simulating the system under a reasonable initial policy.

[1]  R. Bellman Dynamic programming. , 1957, Science.

[2]  Katsuhisa Ohno 161 An optimal control of a production and distribution system , 2009 .

[3]  S. Marcus,et al.  A Simulation-Based Policy Iteration Algorithm for Average Cost Unichain Markov Decision Processes , 2000 .

[4]  Yves Dallery,et al.  Extended kanban control system: combining kanban and base stock , 2000 .

[5]  S. Mahadevan,et al.  Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning , 1999 .

[6]  Edwin K. P. Chong,et al.  Approximate dynamic programming for an inventory problem: Empirical comparison , 2011, Comput. Ind. Eng..

[7]  William L. Cooper,et al.  CONVERGENCE OF SIMULATION-BASED POLICY ITERATION , 2003, Probability in the Engineering and Informational Sciences.

[8]  Steven I. Marcus,et al.  Simulation-based Algorithms for Markov Decision Processes/ Hyeong Soo Chang ... [et al.] , 2013 .

[9]  Vivek F. Farias,et al.  Approximate Dynamic Programming via a Smoothed Linear Program , 2009, Oper. Res..

[10]  David L. Woodruff,et al.  CONWIP: a pull alternative to kanban , 1990 .

[11]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[12]  Asbjoern M. Bonvik,et al.  A comparison of production-line control mechanisms , 1997 .

[13]  Jan Fransoo,et al.  Planning Supply Chain Operations: Definition and Comparison of Planning Concepts , 2003, Supply Chain Management.

[14]  Katsuhisa Ohno,et al.  The optimal control of just-in-time-based production and distribution systems and performance comparisons with optimized pull systems , 2011, Eur. J. Oper. Res..

[15]  John B. Kidd,et al.  Toyota Production System , 1993 .

[16]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[17]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[18]  Warrren B Powell,et al.  A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications , 2011 .

[19]  A. Gosavi,et al.  A reinforcement learning approach to a single leg airline revenue management problem with multiple fare classes and overbooking , 2002 .

[20]  Jennie Si,et al.  Handbook of Learning and Approximate Dynamic Programming (IEEE Press Series on Computational Intelligence) , 2004 .

[21]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[22]  Katsuhisa Ohno,et al.  Computing Optimal Policies for Controlled Tandem Queueing Systems , 1987, Oper. Res..

[23]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[24]  Mitsutoshi Kojima,et al.  Optimal numbers of two kinds of kanbans in a JIT production system , 1995 .

[25]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[26]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annu. Rev. Control..

[27]  Dimitri P. Bertsekas,et al.  Pathologies of temporal difference methods in approximate dynamic programming , 2010, 49th IEEE Conference on Decision and Control (CDC).

[28]  Fangruo Chen,et al.  Optimal Policies for Multi-Echelon Inventory Problems with Batch Ordering , 2000, Oper. Res..

[29]  Warren B. Powell,et al.  “Approximate dynamic programming: Solving the curses of dimensionality” by Warren B. Powell , 2007, Wiley Series in Probability and Statistics.