A MATHEMATICAL ANALYSIS OF ACTOR-CRITIC ARCHITECTURES FOR LEARNING OPTIMAL CONTROLS THROUGH INCREMENTAL DYNAMIC PROGRAMMING (cid:3)

Combining elements of the theory of dynamic programming with features appropriate for on-line learning has led to an approach Watkins has called incre-mental dynamic programming. Here we adopt this incremental dynamic programming point of view and obtain some preliminary mathematical results relevant to understanding the capabilities and limitations of actor-critic learning systems. Examples of such systems are Samuel's learning checker player, Hol-land's bucket brigade algorithm, Witten's adaptive controller, and the adaptive heuristic critic algorithm of Barto, Sutton, and Anderson. Particular emphasis here is on the eeect of complete asynchrony in the updating of the actor and the critic across individual states or state-action pairs. The main results are that, while convergence to optimal performance is not guaranteed in general, there are a number of situations in which such convergence is assured.