True Online Temporal-Difference Learning

The temporal-difference methods TD($\lambda$) and Sarsa($\lambda$) form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, new versions of these methods were introduced, called true online TD($\lambda$) and true online Sarsa($\lambda$), respectively (van Seijen & Sutton, 2014). These new versions maintain an exact equivalence with the forward view at all times, whereas the traditional versions only approximate it for small step-sizes. We hypothesize that these true online methods not only have better theoretical properties, but also dominate the regular methods empirically. In this article, we put this hypothesis to the test by performing an extensive empirical comparison. Specifically, we compare the performance of true online TD($\lambda$)/Sarsa($\lambda$) with regular TD($\lambda$)/Sarsa($\lambda$) on random MRPs, a real-world myoelectric prosthetic arm, and a domain from the Arcade Learning Environment. We use linear function approximation with tabular, binary, and non-binary features. Our results suggest that the true online methods indeed dominate the regular methods. Across all domains/representations the learning speed of the true online methods are often better, but never worse than that of the regular methods. An additional advantage is that no choice between traces has to be made for the true online methods. Besides the empirical results, we provide an in-depth analysis of the theory behind true online temporal-difference learning. In addition, we show that new true online temporal-difference methods can be derived by making changes to the online forward view and then rewriting the update equations.

[1]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[2]  Manfred K. Warmuth,et al.  On the worst-case analysis of temporal-difference learning algorithms , 2004, Machine Learning.

[3]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[4]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[5]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[6]  Peter Dayan,et al.  The convergence of TD(λ) for general λ , 1992, Machine Learning.

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[9]  B Hudgins,et al.  Myoelectric signal processing for control of powered limb prostheses. , 2006, Journal of electromyography and kinesiology : official journal of the International Society of Electrophysiological Kinesiology.

[10]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[11]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[12]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[13]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[14]  P. Thomas,et al.  TD γ : Re-evaluating Complex Backups in Temporal Difference Learning , 2011 .

[15]  Scott Niekum,et al.  TD_gamma: Re-evaluating Complex Backups in Temporal Difference Learning , 2011, NIPS.

[16]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[17]  Patrick M. Pilarski,et al.  Adaptive artificial limbs: a real-time approach to prediction and anticipation , 2013, IEEE Robotics & Automation Magazine.

[18]  Thore Graepel,et al.  A Comparison of learning algorithms on the Arcade Learning Environment , 2014, ArXiv.

[19]  Jacqueline S. Hebert,et al.  Novel Targeted Sensory Reinnervation Technique to Restore Functional Hand Sensation After Transhumeral Amputation , 2014, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[20]  Richard S. Sutton,et al.  True online TD(λ) , 2014, ICML 2014.

[21]  R. Sutton,et al.  A new Q ( � ) with interim forward view and Monte Carlo equivalence , 2014 .

[22]  Doina Precup,et al.  A new Q(lambda) with interim forward view and Monte Carlo equivalence , 2014, ICML.

[23]  Richard S. Sutton,et al.  Off-policy TD( l) with a true online equivalence , 2014, UAI.

[24]  Richard S. Sutton,et al.  Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[25]  Scott Niekum,et al.  Policy Evaluation Using the Ω-Return , 2015, NIPS.

[26]  Richard S. Sutton,et al.  Off-policy learning based on weighted importance sampling with linear computational complexity , 2015, UAI.

[27]  Richard S. Sutton,et al.  Learning to Predict Independent of Span , 2015, ArXiv.

[28]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[29]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[30]  Harm van Seijen,et al.  Effective Multi-step Temporal-Difference Learning for Non-Linear Function Approximation , 2016, ArXiv.