论文信息 - A MATHEMATICAL ANALYSIS OF ACTOR-CRITIC ARCHITECTURES FOR LEARNING OPTIMAL CONTROLS THROUGH INCREMENTAL DYNAMIC PROGRAMMING (cid:3)

A MATHEMATICAL ANALYSIS OF ACTOR-CRITIC ARCHITECTURES FOR LEARNING OPTIMAL CONTROLS THROUGH INCREMENTAL DYNAMIC PROGRAMMING (cid:3)

Combining elements of the theory of dynamic programming with features appropriate for on-line learning has led to an approach Watkins has called incre-mental dynamic programming. Here we adopt this incremental dynamic programming point of view and obtain some preliminary mathematical results relevant to understanding the capabilities and limitations of actor-critic learning systems. Examples of such systems are Samuel's learning checker player, Hol-land's bucket brigade algorithm, Witten's adaptive controller, and the adaptive heuristic critic algorithm of Barto, Sutton, and Anderson. Particular emphasis here is on the eeect of complete asynchrony in the updating of the actor and the critic across individual states or state-action pairs. The main results are that, while convergence to optimal performance is not guaranteed in general, there are a number of situations in which such convergence is assured.

L. Baird | Ronald J. Williams | Iii Leemon C. Baird

[1] Arthur L. Samuel,et al. Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[2] Ian H. Witten,et al. An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[3] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .

[4] Paul J. Werbos,et al. Building and Understanding Adaptive Systems: A Statistical/Numerical Approach to Factory Automation and Brain Research , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[5] A. Barto,et al. Learning and Sequential Decision Making , 1989 .

[6] Richard S. Sutton,et al. Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.