论文信息 - An Empirical Evaluation of True Online TD({\lambda})

An Empirical Evaluation of True Online TD({\lambda})

The true online TD({\lambda}) algorithm has recently been proposed (van Seijen and Sutton, 2014) as a universal replacement for the popular TD({\lambda}) algorithm, in temporal-difference learning and reinforcement learning. True online TD({\lambda}) has better theoretical properties than conventional TD({\lambda}), and the expectation is that it also results in faster learning. In this paper, we put this hypothesis to the test. Specifically, we compare the performance of true online TD({\lambda}) with that of TD({\lambda}) on challenging examples, random Markov reward processes, and a real-world myoelectric prosthetic arm. We use linear function approximation with tabular, binary, and non-binary features. We assess the algorithms along three dimensions: computational cost, learning speed, and ease of use. Our results confirm the strength of true online TD({\lambda}): 1) for sparse feature vectors, the computational overhead with respect to TD({\lambda}) is minimal; for non-sparse features the computation time is at most twice that of TD({\lambda}), 2) across all domains/representations the learning speed of true online TD({\lambda}) is often better, but never worse than that of TD({\lambda}), and 3) true online TD({\lambda}) is easier to use, because it does not require choosing between trace types, and it is generally more stable with respect to the step-size. Overall, our results suggest that true online TD({\lambda}) should be the first choice when looking for an efficient, general-purpose TD method.

[1] Richard S. Sutton,et al. True online TD(λ) , 2014, ICML 2014.

[2] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[3] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[4] B Hudgins,et al. Myoelectric signal processing for control of powered limb prostheses. , 2006, Journal of electromyography and kinesiology : official journal of the International Society of Electrophysiological Kinesiology.

[5] Richard S. Sutton,et al. Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[6] Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014 , 2014, ICML.

[7] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[8] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[9] Peter Dayan,et al. The convergence of TD(λ) for general λ , 1992, Machine Learning.

[10] Patrick M. Pilarski,et al. Adaptive artificial limbs: a real-time approach to prediction and anticipation , 2013, IEEE Robotics & Automation Magazine.

[11] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.