Self-Improving Factory Simulation using Continuous-time Average-Reward Reinforcement Learning

Many factory optimization problems, from inventory control to scheduling and reliability , can be formulated as continuous-time Markov decision processes. A primary goal in such problems is to nd a gain-optimal policy that minimizes the long-run average cost. This paper describes a new average-reward algorithm called SMART for nd-ing gain-optimal policies in continuous time semi-Markov decision processes. The paper presents a detailed experimental study of SMART on a large unreliable production inventory problem. SMART outperforms two well-known reliability heuristics from industrial engineering. A key feature of this study is the integration of the reinforcement learning algorithm directly into two commercial discrete-event simulation packages, ARENA and CSIM, paving the way for this approach to be applied to many other factory optimization problems for which there already exist simulation models.

[1]  J. F. White Models of Preventive Maintenance , 1978 .

[2]  Averill M. Law,et al.  Simulation Modeling and Analysis , 1982 .

[3]  Randall P. Sadowski,et al.  Introduction to Simulation Using Siman , 1990 .

[4]  F. A. van der Duyn Schouten,et al.  Maintenance optimization of a production system with buffer capacity , 1995 .

[5]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[6]  Satinder P. Singh,et al.  Reinforcement Learning Algorithms for Average-Payoff Markovian Decision Processes , 1994, AAAI.

[7]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[9]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[10]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[11]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[12]  Prasad Tadepalli,et al.  Auto-Exploratory Average Reward Reinforcement Learning , 1996, AAAI/IAAI, Vol. 1.

[13]  Sudeep Sarkar,et al.  Optimal preventive maintenance in a production inventory system , 1999 .

[14]  Sridhar Mahadevan,et al.  Average reward reinforcement learning: Foundations, algorithms, and empirical results , 2004, Machine Learning.

[15]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.