Off-policy prediction—learning the value function for one policy from data generated while following another policy—is one of the most challenging subproblems in reinforcement learning. This paper presents empirical results with eleven prominent off-policy learning algorithms that use linear function approximation: five Gradient-TD methods, two Emphatic-TD methods, Off-policy TD(λ), Vtrace, and versions of Tree Backup and ABQ modified to apply to a prediction setting. Our experiments used the Collision task, a small idealized off-policy problem analogous to that of an autonomous car trying to predict whether it will collide with an obstacle. We assessed the performance of the algorithms according to their learning rate, asymptotic error level, and sensitivity to step-size and bootstrapping parameters. By these measures, the eleven algorithms can be partially ordered on the Collision task. In the top tier, the two Emphatic-TD algorithms learned the fastest, reached the lowest errors, and were robust to parameter settings. In the middle tier, the five Gradient-TD algorithms and Off-policy TD(λ) were more sensitive to the bootstrapping parameter. The bottom tier comprised Vtrace, Tree Backup, and ABQ; these algorithms were no faster and had higher asymptotic error than the others. Our results are definitive for this task, though of course experiments with more tasks are needed before an overall assessment of the algorithms’ merits can be made. 1 The Problem of Off-policy Learning In reinforcement learning, it is not uncommon to learn the value function for one policy while following another policy. For example, the Q-learning algorithm (Watkins, 1989; Watkins & Dayan, 1992) learns the value of the greedy policy while the agent may select its actions according to a different, more exploratory, policy. The first policy, the one whose value function is being learned, is called the target policy while the more exploratory policy generating the data is called the behavior policy. When these two policies are different, as they are in Q-learning, the problem is said to be one of off-policy learning, whereas if they are the same, the problem is said to be one of on-policy learning. The former is ‘off’ in the sense that the data is from a different source than the target policy, whereas the latter is from data that is ‘on’ the policy. Off-policy learning is more difficult than on-policy learning and subsumes it as a special case. One reason for interest in off-policy learning is that it provides a clear way of intermixing exploration and exploitation. The classic dilemma is that an agent should always exploit what it has learned so far—it should take the best actions according to what it has learned—but it should also always explore to find actions that might be superior. No agent can simultaneously behave in both ways. However, an off-policy algorithm like Q-learning can, in a sense, pursue both goals at the same time. The behavior policy can explore freely while the target policy can converge to the fully exploitative, optimal policy independent of the behavior policy’s explorations. Preprint. Under review. ar X iv :2 10 6. 00 92 2v 1 [ cs .L G ] 2 J un 2 02 1 Another appealing aspect of off-policy learning is that it enables learning about many policies in parallel. Once the target policy is freed from behavior, there is no reason to have a single target policy. With off-policy learning, an agent could simultaneously learn how to optimally perform many different tasks (as suggested by Jaderberg et al. (2016) and Rafiee et al. (2019)). Parallel off-policy learning of value functions has even been proposed as a way of learning general, policy-dependent, world knowledge (e.g., Sutton et al., 2011; White, 2015; Ring, in prep). Finally, note that numerous ideas in the machine learning literature rely on effective off-policy learning, including the learning of temporally-abstract world models (Sutton, Precup, & Singh, 1999), predictive representations of state (Littman, Sutton, & Singh, 2002; Tanner & Sutton, 2005), auxiliary tasks (Jaderberg et al., 2016), life-long learning (White, 2015), and learning from historical data (Thomas, 2015). Many off-policy learning algorithms have been explored in the history of reinforcement learning. Q-learning (Watkins, 1989; Watkins & Dayan, 1992) is perhaps the oldest. In the 1990s it was realized that combining off-policy learning, function approximation, and temporal-difference (TD) learning risked instability (Baird, 1995). Precup, Sutton, and Singh (2000) introduced off-policy algorithms with importance sampling and eligibility traces, as well as tree backup algorithms, but did not provide a practical solution to the risk of instability. Gradient-TD methods (see Maei, 2011; Sutton et al., 2009) assured stability by following the gradient of an objective function, as suggested by Baird (1999). Emphatic-TD methods (Sutton, Mahmood, & White, 2016) reweighted updates in such a way as to regain the convergence assurances of the original on-policy TD algorithms. These methods had convergence guarantees, but no assurances that they would be efficient in practice. Other off-policy algorithms, including Retrace (Munos et al., 2016), Vtrace (Espeholt et al., 2018), and ABQ (Mahmood, Yu, & Sutton, 2017) were developed recently to overcome difficulties encountered in practice. As more off-policy learning methods were developed, there was a need to compare them systematically. The earliest systematic study was that by Geist and Scherrer (2014). Their experiments were on random MDPs and compared eight major off-policy algorithms. A few months later, Dann, Neumann, and Peters (2014) published a more in-depth study with one additional algorithm (an early Gradient-TD algorithm) and six test problems including random MDPs. Both studies considered off-policy problems in which the target and behavior policies were given and stationary. Such prediction problems allow for relatively simple experiments and are still challenging (e.g., they involve the same risk of instability). Both studies also used linear function approximation with a given feature representation. The algorithms studied by Geist and Scherrer (2014), and by Dann, Neumann, and Peters (2014) can be divided into those whose per-step complexity is linear in the number of parameters, like TD(λ), and methods whose complexity is quadratic in the number of parameters (proportional to the square of the number of parameters), like Least Squares TD(λ) (Bradtke & Barto, 1996; Boyan, 1999). Quadratic-complexity methods avoid the risk of instability, but cannot be used in learning systems with large numbers (e.g., millions) of weights. A third systematic study, by White and White (2016), excluded quadratic-complexity algorithms, but added four additional linear-complexity algorithms. The current paper is similar to previous studies in that it treats prediction with linear function approximation, and similar to the study by White and White (2016) in restricting attention to linear complexity algorithms. Our study differs from earlier studies in that it treats more algorithms and does a deeper empirical analysis on a single problem, the Collision task. The additional algorithms are the prediction variants of Tree Backup(λ) (Precup, Sutton, & Singh, 2000), Retrace(λ) (Munos et al., 2016), ABQ(ζ) (Mahmood, Yu, & Sutton, 2017), and TDRC(λ) (Ghiassian et al., 2020). Our empirical analysis is deeper primarily in that we examine and report the dependency of all eleven algorithms’ performance on all of their parameters. This level of detail is needed to expose our main result, an overall ordering of the performance of off-policy algorithms on the Collision task. Our results, though limited to this task, are a significant addition to what is known about the comparative performance of off-policy learning algorithms.
[1]
A. Juditsky,et al.
Solving variational inequalities with Stochastic Mirror-Prox algorithm
,
2008,
0809.0815.
[2]
Tom Schaul,et al.
Reinforcement Learning with Unsupervised Auxiliary Tasks
,
2016,
ICLR.
[3]
Marek Petrik,et al.
Finite-Sample Analysis of Proximal Gradient TD Algorithms
,
2015,
UAI.
[4]
Leemon C. Baird,et al.
Residual Algorithms: Reinforcement Learning with Function Approximation
,
1995,
ICML.
[5]
Bo Liu,et al.
Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces
,
2014,
ArXiv.
[6]
Martha White,et al.
Investigating Practical Linear Temporal Difference Learning
,
2016,
AAMAS.
[7]
R. Sutton,et al.
Gradient temporal-difference learning algorithms
,
2011
.
[8]
Jan Peters,et al.
Policy evaluation with temporal differences: a survey and comparison
,
2015,
J. Mach. Learn. Res..
[9]
Ben J. A. Kröse,et al.
Learning from delayed rewards
,
1995,
Robotics Auton. Syst..
[10]
Marc G. Bellemare,et al.
Safe and Efficient Off-Policy Reinforcement Learning
,
2016,
NIPS.
[11]
Richard S. Sutton,et al.
Prediction Driven Behavior: Learning Predictions that Drive Fixed Responses
,
2014,
AAAI 2014.
[12]
Adam M White,et al.
DEVELOPING A PREDICTIVE APPROACH TO KNOWLEDGE
,
2015
.
[13]
Richard S. Sutton,et al.
On Generalized Bellman Equations and Temporal-Difference Learning
,
2017,
Canadian Conference on AI.
[14]
Richard S. Sutton,et al.
Prediction in Intelligence: An Empirical Comparison of Off-policy Algorithms on Robots
,
2019,
AAMAS.
[15]
Doina Precup,et al.
Eligibility Traces for Off-Policy Policy Evaluation
,
2000,
ICML.
[16]
Matthieu Geist,et al.
Off-policy learning with eligibility traces: a survey
,
2013,
J. Mach. Learn. Res..
[17]
Shane Legg,et al.
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
,
2018,
ICML.
[18]
Leah M Hackman,et al.
Faster Gradient-TD Algorithms
,
2013
.
[19]
L. Baird.
Reinforcement Learning Through Gradient Descent
,
1999
.
[20]
Martha White,et al.
Unifying Task Specification in Reinforcement Learning
,
2016,
ICML.
[21]
Justin A. Boyan,et al.
Least-Squares Temporal Difference Learning
,
1999,
ICML.
[22]
Pascal Vincent,et al.
Convergent Tree-Backup and Retrace with Function Approximation
,
2017,
ICML.
[23]
Richard S. Sutton,et al.
A First Empirical Study of Emphatic Temporal Difference Learning
,
2017,
ArXiv.
[24]
Richard S. Sutton,et al.
Predictive Representations of State
,
2001,
NIPS.
[25]
Richard S. Sutton,et al.
TD(λ) networks: temporal-difference networks with eligibility traces
,
2005,
ICML.
[26]
R. Sutton,et al.
A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation
,
2008,
NIPS 2008.
[27]
Steven J. Bradtke,et al.
Linear Least-Squares algorithms for temporal difference learning
,
2004,
Machine Learning.
[28]
Shalabh Bhatnagar,et al.
Fast gradient-descent methods for temporal-difference learning with linear function approximation
,
2009,
ICML '09.
[29]
Peter Dayan,et al.
Q-learning
,
1992,
Machine Learning.
[30]
Doina Precup,et al.
Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning
,
1999,
Artif. Intell..
[31]
Philip S. Thomas,et al.
Safe Reinforcement Learning
,
2015
.
[32]
Shie Mannor,et al.
Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis
,
2015,
AAAI.
[33]
Martha White,et al.
An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning
,
2015,
J. Mach. Learn. Res..
[34]
Adam White,et al.
Gradient Temporal-Difference Learning with Regularized Corrections
,
2020,
ICML.
[35]
Patrick M. Pilarski,et al.
Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction
,
2011,
AAMAS.
[36]
Sanjoy Dasgupta,et al.
Off-Policy Temporal Difference Learning with Function Approximation
,
2001,
ICML.