A least squares temporal difference actor–critic algorithm with applications to warehouse management

This article develops a new approximate dynamic programming (DP) algorithm for Markov decision problems and applies it to a vehicle dispatching problem arising in warehouse management. The algorithm is of the actor-critic type and uses a least squares temporal difference learning method. It operates on a sample-path of the system and optimizes the policy within a prespecified class parameterized by a parsimonious set of parameters. The method is applicable to a partially observable Markov decision process setting where the measurements of state variables are potentially corrupted, and the cost is only observed through the imperfect state observations. We show that under reasonable assumptions, the algorithm converges to a locally optimal parameter set. We also show that the imperfect cost observations do not affect the policy and the algorithm minimizes the true expected cost. In the warehouse application, the problem is to dispatch sensor-equipped forklifts in order to minimize operating costs involving product movement delays and forklift maintenance. We consider instances where standard DP is computationally intractable. Simulation results confirm the theoretical claims of the article and show that our algorithm converges more smoothly than earlier actor–critic algorithms while substantially outperforming heuristics used in practice. © 2012 Wiley Periodicals, Inc. Naval Research Logistics, 2012

[1]  A. Barto,et al.  Improved Temporal Difference Methods with Linear Function Approximation , 2004 .

[2]  Dimitri P. Bertsekas,et al.  Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[3]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[4]  Eiji Mizutani,et al.  Two stochastic dynamic programming problems by model-free actor-critic recurrent-network learning in non-Markovian settings , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[5]  Mehdi Khamassi,et al.  Actor–Critic Models of Reinforcement Learning in the Basal Ganglia: From Natural to Artificial Rats , 2005, Adapt. Behav..

[6]  Huizhen Yu,et al.  A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies , 2005, UAI.

[7]  Y. Saad,et al.  Krylov Subspace Methods on Supercomputers , 1989 .

[8]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[9]  C. Niedzwiedz,et al.  A Consolidated Actor-Critic Model with Function Approximation for High-Dimensional POMDPs , 2008 .

[10]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[11]  Mohammad Ghavamzadeh,et al.  Bayesian actor-critic algorithms , 2007, ICML '07.

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[14]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[15]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[16]  Takashi Omori,et al.  Adaptive internal state space construction method for reinforcement learning of a real-world agent , 1999, Neural Networks.

[17]  Dimitri P. Bertsekas,et al.  Discretized Approximations for POMDP with Average Cost , 2004, UAI.

[18]  Andrew G. Barto,et al.  An Actor/Critic Algorithm that is Equivalent to Q-Learning , 1994, NIPS.

[19]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[20]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[21]  Douglas Aberdeen,et al.  Policy-Gradient Algorithms for Partially Observable Markov Decision Processes , 2003 .

[22]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[23]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[24]  Shalabh Bhatnagar,et al.  Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[25]  Peter L. Bartlett,et al.  Reinforcement Learning in POMDP's via Direct Gradient Ascent , 2000, ICML.

[26]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[27]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[28]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[29]  Vivek S. Borkar,et al.  An actor-critic algorithm for constrained Markov decision processes , 2005, Syst. Control. Lett..

[30]  Jack Dongarra,et al.  Templates for the Solution of Algebraic Eigenvalue Problems , 2000, Software, environments, tools.

[31]  Warren B. Powell,et al.  Approximate Dynamic Programming I: Modeling , 2011 .

[32]  Ioannis Ch. Paschalidis,et al.  Intelligent forklift dispatching in warehouses using a sensor network , 2009, 2009 17th Mediterranean Conference on Control and Automation.

[33]  S. A. Soman,et al.  Application of Actor-Critic Learning Algorithm for Optimal Bidding Problem of a Genco , 2002, IEEE Power Engineering Review.

[34]  Jiaqiao Hu,et al.  Simulation-based Algorithms for Markov Decision Processes (Communications and Control Engineering) , 2007 .

[35]  Michael T. Rosenstein,et al.  Supervised Actor‐Critic Reinforcement Learning , 2012 .

[36]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[37]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[38]  Hamid R. Berenji,et al.  A convergent actor-critic-based FRL algorithm with application to power management of wireless transmitters , 2003, IEEE Trans. Fuzzy Syst..