Actor-critic Algorithms 1. Policy Gradient Methods for Reinforcement Learning with Function Average Reward Td Actor-critic Algorithm Using Func- Tion Approximation

This writeup is not an exhaustive description of what was presented in class. I just describe one family of algorithms that uses the ideas presented. The title above says it all. We are looking at an Actor-Critic algorithm, that uses a policy gradient approach. We use the average reward criterion. The policy is directly represented using a set of parameters. The parameters could be preferences, as we have seen earlier, or could be thresholds as discussed in class. A better choice might be to use one linear function approximator for each action that computes the preference for that action. The critic, whose output is used in computing the gradient information for the actor, is represented by a separate linear function approximator. Let Θ = {θ, · · · , θ} be the parameters of the actor. Let ΦΘ = {φ 1 Θ, · · · , φ m Θ} be the features of the critic. Note that the features are dependent on Θ. The dependence is explained below. Consider the space spanned by the φs. This is the space comprising of