GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces

A new family of gradient temporal-difference learning algorithms have recently been introduced by Sutton, Maei and others in which function approximation is much more straightforward. In this paper, we introduce the GQ(λ) algorithm which can be seen as extension of that work to a more general setting including eligibility traces and off-policy learning of temporally abstract predictions. These extensions bring us closer to the ultimate goal of this work—the development of a universal prediction learning algorithm suitable for learning experientially grounded knowledge of the world. Eligibility traces are essential to this goal because they bridge the temporal gaps in cause and effect when experience is processed at a temporally fine resolution. Temporally abstract predictions are also essential as the means for representing abstract, higher-level knowledge about courses of action, or options. GQ(λ) can be thought of as an extension of Q-learning. We extend existing convergence results for policy evaluation to this setting and carry out a forward-view/backwardview analysis to derive and prove the validity of the new algorithm. Introduction One of the main challenges in artificial intelligence (AI) is to connect the low-level experience to high-level representations (grounded world knowledge). Low-level experience refers to rich signals received back and forth between the agent and the world. Recent theoretical developments in temporal-difference learning combined with mathematical ideas developed for temporally abstract options, known as intra-option learning, can be used to address this challenge (Sutton, 2009). Intra-option learning (Sutton, Precup, and Singh, 1998) is seen as a potential method for temporalabstraction in reinforcement learning. Intra-option learning is a type of off-policy learning. Off-policy learning refers to learning about a target policy while following another policy, known as behavior policy. Offpolicy learning arises in Q-learning where the target policy is a greedy optimal policy while the behavior policy is exploratory. It is also needed for intra-option learning. Intra-option methods look inside options and allow AI agent to learn about multiple different options simultaneously from a single stream of received data. Option refers to a temporally course of actions with a termination condition. Options are ubiquitous in our everyday life. For example, to go for hiking, we need to consider and evaluate multiple options such as transportation options to the hiking trail. Each option includes a course of primitive actions and only is excited in particular states. The main feature of intra-option learning is its ability to predict the consequences of each option policy without executing it while data is received from a different policy. Temporal difference (TD) methods in reinforcement learning are considered as powerful techniques for prediction problems. In this paper, we consider predictions always in the form of answers to the questions. Questions are like “If of follow this trail, would I see a creek?” The answers to such questions are in the form of a single scalar (value function) that tells us about the expected future consequences given the current state. In general, due to the large number of states, it is not feasible to compute the exact value of each state entry. One of the key features of TD methods is their ability to generalize predictions to states that may not have visited; this is known as function approximation. Recently, Sutton et al. (2009b) and Maei et al. (2009) introduced a new family of gradient TD methods in which function approximation is much more straightforward than conventional methods. Prior to their work, the existing classical TD algorithms (e.g.; TD(λ) and Q-learning) with function approximation could become unstable and diverge (Baird, 1995; Tsitsiklis and Van Roy, 1997). In this paper, we extend their work to a more general setting that includes off-policy learning of temporally abstract predictions and eligibility traces. Temporally abstract predictions are essential for representing higher-level knowledge about the course of actions, or options (Sutton et al., 1998). Eligibility traces bridge between the temporal gaps when experience is processes at a temporally fine resolution. In this paper, we introduce the GQ(λ) algorithm that can be thought of as an extension to Q-learning (Watkins and Dayan, 1989); one of the most popular off-policy learning algorithms in reinforcement learning. Our algorithm incorporates gradient-descent ideas originally developed by Sutton et al. (2009a,b), for option conditional predictions with varying eligibility traces. We extend existing convergence results for policy evaluation to this setting and carry forward-view/backwardview analysis and prove the validity of the new algorithm. The organization of the paper is as follows: First, we describe the problem setting and define our notations. Then we introduce the GQ(λ) algorithm and describe how to use it. In the next sections we provide derivation of the algorithm and carry out analytical analysis on the equivalence of TD forward-view/backward-view. We finish the paper with convergence proof and conclusion section. Notation and background We consider the problem of policy evaluation in finite state-action Markov Decision Process (MDP). Under standard conditions, however, our results can be extended to MDPs with infinite state–action pairs. We use a standard reinforcement learning (RL) framework. In this setting, data is obtained from a continually evolving MDP with states st ∈ S, actions at ∈ A, and rewards rt ∈ <, for t = 1, 2, . . ., with each state and reward as a function of the preceding state and action. Actions are chosen according to the behavior policy b, which is assumed fixed and exciting, b(s, a) > 0,∀s, a. We consider the transition probabilities between state– action pairs, and for simplicity we assume there is a finite number N of state–action pairs. Suppose the agent find itself at time t in a state– action pair st, at. The agent likes its answer at that time to tell something about the future sequence st+1, at+1, . . . , st+k if actions from t + 1 on were taken according to the option until it terminated at time t+k. The option policy is denoted π : S × A → [0, 1] and whose termination condition is denoted β : S → [0, 1]. The answer is always in the form of a single number, and of course we have to be more specific about what we are trying to predict. There are two common cases: 1) we are trying to predict the outcome of the option; we want to know about the expected value of some function of the state at the time the option terminates. We call this function the outcome target function, and denote it z : S → <, 2) we are trying to predict the transient; that is, what happens during the option rather than its end. The most common thing to predict about the transient is the total or discounted reward during the option. We denote the reward function r : S ×A → <. Finally, the answer could conceivably be a mixture of both a transient and an outcome. Here we will present the algorithm for answering questions with both an outcome part z and a transient part r, with the two added together. In the common place where one wants only one of the two, the other is set to zero. Now we can start to state the goal of learning more precisely. In particular, we would like our answer to be equal to the expected value of the outcome target function at termination plus the cumulative sum of the transient reward function along the way: Q(st, at) (1) ≡ E [ rt+1 + γrt+2 + · · ·+ γrt+k + zt+k | π, β ] , where γ ∈ (0, 1] is discount factor and Q(s, a) denotes action value function that evaluates policy π given state-action pair s, a. To simplify the notation, from now on, we drop the superscript π on action values. In many problems the number of state-action pairs is large and therefore it is not feasible to compute the action values for each state-action entry. Therefore, we need to approximate the action values through generalization techniques. Here, we use linear function approximation; that is, the answer to a question is always formed linearly as Qθ(s, a) = θ>φ(s, a) ≈ Q(s, a) for all s ∈ S and a ∈ A, where θ ∈ < is a learned weight vector and φ(s, a) ∈ < indicates a state–action feature vector. The goal is to learn parameter vector θ through a learning method such as TD learning. The above (1) describes the target in a Monte Carlo sense, but of course we want to include the possibility of temporal-difference learning; one of the widely used techniques in reinforcement learning. To do this, we provide an eligibility-trace function λ : S → [0, 1] as described in Sutton and Barto (1998). We let eligibilitytrace function, λ, to vary over different states. In the next section, first we introduce GQ(λ); a general temporal-difference learning algorithm that is stable under off-policy training, and show how to use it. Then in later sections we provide the derivation of the algorithm and convergence proof. The GQ(λ) algorithm In this section we introduce the GQ(λ) algorithm for off-policy learning about the outcomes and transients of options, in other words, intra-option GQ(λ) for learning the answer to a question chosen from a wide (possibly universal) class of option-conditional predictive questions. To specify the question one provides four functions: π and β, for the option, and z and r, for the target functions. To specify how the answers will be formed one provides their functional form (here in linear form), the feature vectors φ(s, a) for all state–action pairs, and the eligibility-trace function λ. The discount factor γ can be taken to be 1, and thus ignored; the same effect as discounting can be achieved through the choice of β. Now, we specify the GQ(λ) algorithm as follows: The weight vector θ ∈ < is initialized arbitrarily. The secondary weight vector w ∈ < is init