论文信息 - Emphatic Temporal-Difference Learning - 字舞流文

Emphatic Temporal-Difference Learning

Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps. Recent works by Sutton, Mahmood and White (2015), and Yu (2015) show that by varying the emphasis in a particular way, these algorithms become stable and convergent under off-policy training with linear function approximation. This paper serves as a unified summary of the available results from both works. In addition, we demonstrate the empirical benefits from the flexibility of emphatic algorithms, including state-dependent discounting, state-dependent bootstrapping, and the user-specified allocation of function approximation resources.

Martha White | Richard S. Sutton | Huizhen Yu | Ashique Rupam Mahmood | R. Sutton | Huizhen Yu | Martha White | A. Mahmood

[1] J. Gillis,et al. Matrix Iterative Analysis , 1961 .

[2] Richard S. Sutton,et al. TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[3] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[4] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[5] Andrew G. Barto,et al. Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[6] Justin A. Boyan,et al. Least-Squares Temporal Difference Learning , 1999, ICML.

[7] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[8] H. Kushner,et al. Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[9] John N. Tsitsiklis,et al. Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[10] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[12] Richard S. Sutton,et al. GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[13] R. Sutton,et al. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[14] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .

[15] Huizhen Yu,et al. Least Squares Temporal Difference Methods: An Analysis under General Conditions , 2012, SIAM J. Control. Optim..

[16] Richard S. Sutton,et al. Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[17] Philip Thomas,et al. Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[18] Richard S. Sutton,et al. True online TD(λ) , 2014, ICML 2014.

[19] Richard S. Sutton,et al. Off-policy TD( l) with a true online equivalence , 2014, UAI.

[20] Huizhen Yu,et al. On Convergence of Emphatic Temporal-Difference Learning , 2015, COLT.

[21] Richard S. Sutton,et al. Off-policy learning based on weighted importance sampling with linear computational complexity , 2015, UAI.

[22] Martha White,et al. An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..