论文信息 - Self-Improving Factory Simulation using Continuous-time Average-Reward Reinforcement Learning

Self-Improving Factory Simulation using Continuous-time Average-Reward Reinforcement Learning

Many factory optimization problems, from inventory control to scheduling and reliability , can be formulated as continuous-time Markov decision processes. A primary goal in such problems is to nd a gain-optimal policy that minimizes the long-run average cost. This paper describes a new average-reward algorithm called SMART for nd-ing gain-optimal policies in continuous time semi-Markov decision processes. The paper presents a detailed experimental study of SMART on a large unreliable production inventory problem. SMART outperforms two well-known reliability heuristics from industrial engineering. A key feature of this study is the integration of the reinforcement learning algorithm directly into two commercial discrete-event simulation packages, ARENA and CSIM, paving the way for this approach to be applied to many other factory optimization problems for which there already exist simulation models.

[1] J. F. White. Models of Preventive Maintenance , 1978 .

[2] Averill M. Law,et al. Simulation Modeling and Analysis , 1982 .

[3] Randall P. Sadowski,et al. Introduction to Simulation Using Siman , 1990 .

[4] F. A. van der Duyn Schouten,et al. Maintenance optimization of a production system with buffer capacity , 1995 .

[5] Anton Schwartz,et al. A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[6] Satinder P. Singh,et al. Reinforcement Learning Algorithms for Average-Payoff Markovian Decision Processes , 1994, AAAI.

[7] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8] Michael O. Duff,et al. Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[9] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[10] Andrew G. Barto,et al. Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[11] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[12] Prasad Tadepalli,et al. Auto-Exploratory Average Reward Reinforcement Learning , 1996, AAAI/IAAI, Vol. 1.

[13] Sudeep Sarkar,et al. Optimal preventive maintenance in a production inventory system , 1999 .

[14] Sridhar Mahadevan,et al. Average reward reinforcement learning: Foundations, algorithms, and empirical results , 2004, Machine Learning.

[15] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.