论文信息 - TD_gamma: Re-evaluating Complex Backups in Temporal Difference Learning

TD_gamma: Re-evaluating Complex Backups in Temporal Difference Learning

We show that the λ-return target used in the TD(λ) family of algorithms is the maximum likelihood estimator for a specific model of how the variance of an n-step return estimate increases with n. We introduce the γ-return estimator, an alternative target based on a more accurate model of variance, which defines the TDγ family of complex-backup temporal difference learning algorithms. We derive TDγ, the γ-return equivalent of the original TD(λ) algorithm, which eliminates the λ parameter but can only perform updates at the end of an episode and requires time and space proportional to the episode length. We then derive a second algorithm, TDγ(C), with a capacity parameter C. TDγ(C) requires C times more time and memory than TD(λ) and is incremental and online. We show that TDγ outperforms TD(λ) for any setting of λ on 4 out of 5 benchmark domains, and that TDγ(C) performs as well as or better than TDγ for intermediate settings of C.

[1] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[2] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[3] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[4] Richard S. Sutton,et al. Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[5] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[6] Richard S. Sutton,et al. GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[7] Pawel Cichosz,et al. Truncating Temporal Differences: On the Efficient Implementation of TD(lambda) for Reinforcement Learning , 1994, J. Artif. Intell. Res..

[8] George Konidaris,et al. Value Function Approximation in Reinforcement Learning Using the Fourier Basis , 2011, AAAI.

[9] Scott Sanner,et al. Temporal Difference Bayesian Model Averaging: A Bayesian Perspective on Adapting Lambda , 2010, ICML.

[10] Justin A. Boyan,et al. Least-Squares Temporal Difference Learning , 1999, ICML.

[11] Kenji Doya,et al. Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.