论文信息 - True Online Emphatic TD(λ): Quick Reference and Implementation Guide

True Online Emphatic TD(λ): Quick Reference and Implementation Guide

This document is a guide to the implementation of true online emphatic TD($\lambda$), a model-free temporal-difference algorithm for learning to make long-term predictions which combines the emphasis idea (Sutton, Mahmood & White 2015) and the true-online idea (van Seijen & Sutton 2014). The setting used here includes linear function approximation, the possibility of off-policy training, and all the generality of general value functions, as well as the emphasis algorithm's notion of "interest".

Richard S. Sutton | R. Sutton

[1] Richard S. Sutton,et al. Off-policy TD( l) with a true online equivalence , 2014, UAI.

[2] Martin A. Riedmiller,et al. A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[3] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[4] Huizhen Yu,et al. On Convergence of Emphatic Temporal-Difference Learning , 2015, COLT.

[5] Patrick M. Pilarski,et al. An Empirical Evaluation of True Online TD(λ) , 2015, ArXiv.

[6] Martha White,et al. An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[7] R. Sutton,et al. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[8] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[9] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[10] Philip Thomas,et al. Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[11] Patrick M. Pilarski,et al. Tuning-free step-size adaptation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Richard S. Sutton,et al. True online TD(λ) , 2014, ICML 2014.

[13] Adam M White,et al. DEVELOPING A PREDICTIVE APPROACH TO KNOWLEDGE , 2015 .

[14] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[15] Andrew G. Barto,et al. Adaptive Step-Size for Online Temporal Difference Learning , 2012, AAAI.

[16] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .

[17] Richard S. Sutton,et al. Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..