论文信息 - Reinforcement Learning Methods for Continuous-Time Markov Decision Problems

Reinforcement Learning Methods for Continuous-Time Markov Decision Problems

Semi-Markov Decision Problems are continuous time generalizations of discrete time Markov Decision Problems. A number of reinforcement learning algorithms have been developed recently for the solution of Markov Decision Problems, based on the ideas of asynchronous dynamic programming and stochastic approximation. Among these are TD(λ), Q-learning, and Real-time Dynamic Programming. After reviewing semi-Markov Decision Problems and Bellman's optimality equation in that context, we propose algorithms similar to those named above, adapted to the solution of semi-Markov Decision Problems. We demonstrate these algorithms by applying them to the problem of determining the optimal control for a simple queueing system. We conclude with a discussion of circumstances under which these algorithms may be usefully applied.

Michael O. Duff | Steven J. Bradtke | S. Bradtke | M. Duff

[1] E. Denardo. CONTRACTION MAPPINGS IN THE THEORY UNDERLYING DYNAMIC PROGRAMMING , 1967 .

[2] B. Hajek. Optimal control of two interacting service stations , 1982, 1982 21st IEEE Conference on Decision and Control.

[3] B. Hajek. Optimal control of two interacting service stations , 1982, CDC 1982.

[4] Dimitri P. Bertsekas,et al. Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[5] C. Watkins. Learning from delayed rewards , 1989 .

[6] John Moody,et al. Learning rate schedules for faster stochastic gradient search , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.

[7] John N. Tsitsiklis,et al. Asynchronous stochastic approximation and Q-learning , 1993, Proceedings of 32nd IEEE Conference on Decision and Control.

[8] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[9] P. Dayan,et al. TD(λ) converges with probability 1 , 2004, Machine Learning.

[10] Steven J. Bradtke,et al. Incremental dynamic programming for on-line adaptive optimal control , 1995 .

[11] Andrew G. Barto,et al. Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..