论文信息 - Adaptive Lambda Least-Squares Temporal Difference Learning

Adaptive Lambda Least-Squares Temporal Difference Learning

Temporal Difference learning or TD($\lambda$) is a fundamental algorithm in the field of reinforcement learning. However, setting TD's $\lambda$ parameter, which controls the timescale of TD updates, is generally left up to the practitioner. We formalize the $\lambda$ selection problem as a bias-variance trade-off where the solution is the value of $\lambda$ that leads to the smallest Mean Squared Value Error (MSVE). To solve this trade-off we suggest applying Leave-One-Trajectory-Out Cross-Validation (LOTO-CV) to search the space of $\lambda$ values. Unfortunately, this approach is too computationally expensive for most practical applications. For Least Squares TD (LSTD) we show that LOTO-CV can be implemented efficiently to automatically tune $\lambda$ and apply function optimization methods to efficiently search the space of $\lambda$ values. The resulting algorithm, ALLSTD, is parameter free and our experiments demonstrate that ALLSTD is significantly computationally faster than the na\"{i}ve LOTO-CV implementation while achieving similar performance.

[1] Bruno Scherrer,et al. Rate of Convergence and Error Bounds for LSTD(λ) , 2014, ICML 2015.

[2] Justin A. Boyan,et al. Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[3] Scott Sanner,et al. Temporal Difference Bayesian Model Averaging: A Bayesian Perspective on Adapting Lambda , 2010, ICML.

[4] Guy Shani,et al. Evaluating Recommendation Systems , 2011, Recommender Systems Handbook.

[5] Philip S. Thomas,et al. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[6] Martha White,et al. A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning , 2016, AAMAS.

[7] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[8] H. He,et al. Efficient Reinforcement Learning Using Recursive Least-Squares Methods , 2011, J. Artif. Intell. Res..

[9] Andrew G. Barto,et al. Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[10] Scott Niekum,et al. Policy Evaluation Using the Ω-Return , 2015, NIPS.

[11] Klaus-Robert Müller,et al. Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[12] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13] J. Sherman,et al. Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix , 1950 .

[14] P. Thomas,et al. TD γ : Re-evaluating Complex Backups in Temporal Difference Learning , 2011 .

[15] Scott Niekum,et al. TD_gamma: Re-evaluating Complex Backups in Temporal Difference Learning , 2011, NIPS.

[16] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[17] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..