Actor-Critic Algorithms

Many complex decision making problems like scheduling in manufacturing systems, portfolio management in finance, admission control in communication networks etc., with clear and precise objectives, can be formulated as stochastic dynamic programming problems in which the objective of decision making is to maximize a single “overall” reward. In these formulations, finding an optimal decision policy involves computing a certain “value function” which assigns to each state the optimal reward one would obtain if the system was started from that state. This function then naturally prescribes the optimal policy, which is to take decisions that drive the system to states with maximum value. For many practical problems, the computation of the exact value function is intractable, analytically and numerically, due to the enormous size of the state space. Therefore one has to resort to one of the following approximation methods to find a good sub-optimal policy: (1) Approximate the value function. (2) Restrict the search for a good policy to a smaller family of policies. In this thesis, we propose and study actor-critic algorithms which combine the above two approaches with simulation to find the best policy among a parameterized class of policies. Actor-critic algorithms have two learning units: an actor and a critic. An actor is a decision maker with a tunable parameter. A critic is a function approximator. The critic tries to approximate the value function of the policy used by the actor, and the actor in turn tries to improve its policy based on the current approximation provided by the critic. Furthermore, the critic evolves on a faster time-scale than the actor. We propose several variants of actor-critic algorithms. In all the variants, the critic uses Temporal Difference (TD) learning with linear function approximation. Some of the variants are inspired by a new geometric interpretation of the formula for the gradient of the overall reward with respect to the actor parameters. This interpretation suggests a natural set of basis functions for the critic, determined by the family of policies parameterized by the actor's parameters. We concentrate on the average expected reward criterion but we also show how the algorithms can be modified for other objective criteria. We prove convergence of the algorithms for problems with general (finite, countable, or continuous) state and decision spaces. To compute the rate of convergence (ROC) of our algorithms, we develop a general theory of the ROC of two-time-scale algorithms and we apply it to study our algorithms. In the process, we study the ROC of TD learning and compare it with related methods such as Least Squares TD (LSTD). We study the effect of the basis functions used for linear function approximation on the ROC of TD. We also show that the ROC of actor-critic algorithms does not depend on the actual basis functions used in the critic but depends only on the subspace spanned by them and study this dependence. Finally, we compare the performance of our algorithms with other algorithms that optimize over a parameterized family of policies. We show that when only the “natural” basis functions are used for the critic, the rate of convergence of the actor critic algorithms is the same as that of certain stochastic gradient descent algorithms. However, with appropriate additional basis functions for the critic, we show that our algorithms outperform the existing ones in terms of ROC. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  K. Athreya,et al.  A New Approach to the Limit Theory of Recurrent Markov Chains , 1978 .

[2]  Peter W. Glynn,et al.  Stochastic approximation for Monte Carlo optimization , 1986, WSC '86.

[3]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[4]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[5]  P. Glynn,et al.  Likelihood ratio gradient estimation for stochastic recursions , 1995, Advances in Applied Probability.

[6]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[7]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[9]  V. Borkar Stochastic approximation with two time scales , 1997 .

[10]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[11]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[12]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[13]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[14]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..