Actor-Critic - Type Learning Algorithms for Markov Decision Processes

Algorithms for learning the optimal policy of a Markov decision process (MDP) based on simulated transitions are formulated and analyzed. These are variants of the well-known "actor-critic" (or "adaptive critic") algorithm in the artificial intelligence literature. Distributed asynchronous implementations are considered. The analysis involves two time scale stochastic approximations.

[1]  F. Wilson,et al.  Smoothing derivatives of functions and applications , 1969 .

[2]  J. Neveu,et al.  Discrete Parameter Martingales , 1975 .

[3]  V. Nollau Kushner, H. J./Clark, D. S., Stochastic Approximation Methods for Constrained and Unconstrained Systems. (Applied Mathematical Sciences 26). Berlin‐Heidelberg‐New York, Springer‐Verlag 1978. X, 261 S., 4 Abb., DM 26,40. US $ 13.20 , 1980 .

[4]  P. Kokotovic Applications of Singular Perturbation Techniques to Control Problems , 1984 .

[5]  Schäl Manfred Estimation and control in discounted stochastic dynamic programming , 1987 .

[6]  Morris W. Hirsch,et al.  Convergent activation dynamics in continuous time networks , 1989, Neural Networks.

[7]  R. Pemantle,et al.  Nonconvergence to Unstable Points in Urn Models and Stochastic Approximations , 1990 .

[8]  John N. Tsitsiklis,et al.  An Analysis of Stochastic Shortest Path Problems , 1991, Math. Oper. Res..

[9]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[10]  B Ravindran,et al.  A tutorial survey of reinforcement learning , 1994 .

[11]  Vivek S. Borkar,et al.  The actor-critic algorithm as multi-time-scale stochastic approximation , 1997 .

[12]  V. Borkar Stochastic approximation with two time scales , 1997 .

[13]  V. Borkar Recursive self-tuning control of finite Markov chains , 1997 .

[14]  P. S. Sastry,et al.  A reinforcement learning neural network for adaptive control of Markov chains , 1997, IEEE Trans. Syst. Man Cybern. Part A.

[15]  V. Borkar,et al.  Stability and convergence of stochastic approximation using the ODE method , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[16]  D. Bertsekas A New Value Iteration method for the Average Cost Dynamic Programming Problem , 1998 .

[17]  V. Borkar Asynchronous Stochastic Approximations , 1998 .