论文信息 - Regularized Off-Policy TD-Learning

Regularized Off-Policy TD-Learning

We present a novel l1 regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying RO-TD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm.

[1] G. M. Korpelevich. The extragradient method for finding saddle points and other problems , 1976 .

[2] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[3] Jennie Si,et al. Online learning control by association and reinforcement , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[4] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[5] Arkadi Nemirovski,et al. Non-euclidean restricted memory level method for large-scale convex optimization , 2005, Math. Program..

[6] Y. Nesterov. Gradient methods for minimizing composite objective function , 2007 .

[7] G. Saridis,et al. Journal of Optimization Theory and Applications Approximate Solutions to the Time-invariant Hamilton-jacobi-bellman Equation 1 , 1998 .

[8] Andrew Y. Ng,et al. Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[9] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[10] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[11] Angelia Nedic,et al. Subgradient Methods for Saddle-Point Problems , 2009, J. Optimization Theory and Applications.

[12] Marek Petrik,et al. Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes , 2010, ICML.

[13] Ronald Parr,et al. Linear Complementarity for Regularized Policy Evaluation and Improvement , 2010, NIPS.

[14] Richard S. Sutton,et al. GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[15] R. Sutton,et al. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .