Online Off-policy Prediction

This paper investigates the problem of online prediction learning, where learning proceeds continuously as the agent interacts with an environment. The predictions made by the agent are contingent on a particular way of behaving, represented as a value function. However, the behavior used to select actions and generate the behavior data might be different from the one used to define the predictions, and thus the samples are generated off-policy. The ability to learn behavior-contingent predictions online and off-policy has long been advocated as a key capability of predictive-knowledge learning systems but remained an open algorithmic challenge for decades. The issue lies with the temporal difference (TD) learning update at the heart of most prediction algorithms: combining bootstrapping, off-policy sampling and function approximation may cause the value estimate to diverge. A breakthrough came with the development of a new objective function that admitted stochastic gradient descent variants of TD. Since then, many sound online off-policy prediction algorithms have been developed, but there has been limited empirical work investigating the relative merits of all the variants. This paper aims to fill these empirical gaps and provide clarity on the key ideas behind each method. We summarize the large body of literature on off-policy learning, focusing on 1- methods that use computation linear in the number of features and are convergent under off-policy sampling, and 2- other methods which have proven useful with non-fixed, nonlinear function approximation. We provide an empirical study of off-policy prediction methods in two challenging microworlds. We report each method's parameter sensitivity, empirical convergence rate, and final performance, providing new insights that should enable practitioners to successfully extend these new methods to large-scale applications.[Abridged abstract]

[1]  Martha White,et al.  Unifying Task Specification in Reinforcement Learning , 2016, ICML.

[2]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[3]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[4]  Bo Liu,et al.  Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.

[5]  Pascal Vincent,et al.  Convergent Tree-Backup and Retrace with Function Approximation , 2017, ICML.

[6]  Richard S. Sutton,et al.  Multi-step Off-policy Learning Without Importance Sampling Ratios , 2017, ArXiv.

[7]  J. Zico Kolter,et al.  The Fixed Points of Off-Policy TD , 2011, NIPS.

[8]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[9]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[10]  Richard S. Sutton,et al.  TD(λ) networks: temporal-difference networks with eligibility traces , 2005, ICML.

[11]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[12]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[13]  A. Eigen-analysis Stochastic Variance Reduction Methods for Policy Evaluation , 2017 .

[14]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[15]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[16]  Leah M Hackman,et al.  Faster Gradient-TD Algorithms , 2013 .

[17]  R. Sutton,et al.  A new Q ( � ) with interim forward view and Monte Carlo equivalence , 2014 .

[18]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[19]  Shie Mannor,et al.  Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis , 2015, AAAI.

[20]  Doina Precup,et al.  A new Q(lambda) with interim forward view and Monte Carlo equivalence , 2014, ICML.

[21]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[22]  Huizhen Yu,et al.  Weak Convergence Properties of Constrained Emphatic Temporal-difference Learning with Constant and Slowly Diminishing Stepsize , 2015, J. Mach. Learn. Res..

[23]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[24]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[25]  Huizhen Yu,et al.  On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning , 2017, ArXiv.

[26]  Adam M White,et al.  DEVELOPING A PREDICTIVE APPROACH TO KNOWLEDGE , 2015 .

[27]  Richard S. Sutton,et al.  Off-policy TD( l) with a true online equivalence , 2014, UAI.

[28]  Huizhen Yu,et al.  On Convergence of Emphatic Temporal-Difference Learning , 2015, COLT.

[29]  Toby Walsh,et al.  The Scaling of Search Cost , 1997, AAAI/IAAI.

[30]  Matthieu Geist,et al.  Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..

[31]  Richard S. Sutton,et al.  GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[32]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[33]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[34]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[35]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[36]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[37]  A. Juditsky,et al.  Solving variational inequalities with Stochastic Mirror-Prox algorithm , 2008, 0809.0815.

[38]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[39]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[40]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[41]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[42]  Marek Petrik,et al.  Proximal Gradient Temporal Difference Learning Algorithms , 2016, IJCAI.

[43]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[44]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.