Look-ahead control of conveyor-serviced production station by using potential-based online policy iteration

We consider the look-ahead control of a conveyor-serviced production station (CSPS) in the context of semi-Markov decision process (SMDP) model, and our goal is to find an optimal control policy under either average- or discounted-cost criteria. Policy iteration (PI), combined with the concept of performance potential, can be applied to provide a unified optimisation framework for both criteria. However, a major difficulty arises in the exact solution scheme, that is, it requires not only the full knowledge of model parameters, but also a considerable amount of work to obtain and process the necessary system and performance matrices. To overcome this difficulty, we propose a potential-based online PI algorithm in this article. During implementation, by analysing and utilising the historic information of all the past operation of a practical CSPS system, the potentials and state-action values are learned on line through an effective exploration scheme. We finally illustrate the successful application of this learning-based technique in CSPS systems by an example.

[1]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[2]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[3]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[4]  TANGHao,et al.  Performance Potential-based Neuro-dynamic Programming for SMDPs , 2005 .

[5]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[6]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[7]  Vivek S. Borkar,et al.  Learning Algorithms for Markov Decision Processes with Average Cost , 2001, SIAM J. Control. Optim..

[8]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[9]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[10]  S. Mahadevan,et al.  Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning , 1999 .

[11]  Eginhard J. Muth,et al.  Conveyor Theory: A Survey , 1979 .

[12]  Yuan Ji Performance Potential-based Neuro-dynamic Programming for SMDPs , 2005 .

[13]  Robert M. Crisp,et al.  A Discrete-Time Queuing Analysis of Conveyor-Serviced Production Stations , 1968, Oper. Res..

[14]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[15]  Masayuki Matsui,et al.  A Queueing Analysis of Conveyor-Serviced Production Station and the Optimal Range Strategy , 1978 .

[16]  W. M. NAWlJN The analysis of a conveyor-serviced production station , 2003 .

[17]  Hongsheng Xi,et al.  The optimal robust control policy for uncertain semi-Markov control processes , 2005, Int. J. Syst. Sci..

[18]  Masayuki Matsui,et al.  CSPS model: Look-ahead controls and physics , 2005 .

[19]  Willem M. Nawijn The Optimal Look-Ahead Policy for Admission to a Single Server System , 1985, Oper. Res..

[20]  Abhijit Gosavi,et al.  A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis , 2004, Machine Learning.

[21]  H.-H. Wang,et al.  Successive approximation approach of optimal control for nonlinear discrete-time systems , 2005, Int. J. Syst. Sci..

[22]  Hongsheng Xi,et al.  Error bounds of optimization algorithms for semi-Markov decision processes , 2007, Int. J. Syst. Sci..

[23]  Xi-Ren Cao,et al.  Potential-based online policy iteration algorithms for Markov decision processes , 2004, IEEE Transactions on Automatic Control.

[24]  Abhijit Gosavi,et al.  Reinforcement learning for long-run average cost , 2004, Eur. J. Oper. Res..

[25]  Arnaud Doucet,et al.  A policy gradient method for semi-Markov decision processes with application to call admission control , 2007, Eur. J. Oper. Res..