Incremental Truncated LSTD

Balancing between computational efficiency and sample efficiency is an important goal in reinforcement learning. Temporal difference (TD) learning algorithms stochastically update the value function, with a linear time complexity in the number of features, whereas least-squares temporal difference (LSTD) algorithms are sample efficient but can be quadratic in the number of features. In this work, we develop an efficient incremental low-rank LSTD({\lambda}) algorithm that progresses towards the goal of better balancing computation and sample efficiency. The algorithm reduces the computation and storage complexity to the number of features times the chosen rank parameter while summarizing past samples efficiently to nearly obtain the sample complexity of LSTD. We derive a simulation bound on the solution given by truncated low-rank approximation, illustrating a bias- variance trade-off dependent on the choice of rank. We demonstrate that the algorithm effectively balances computational complexity and sample efficiency for policy evaluation in a benchmark task and a high-dimensional energy allocation domain.

[1]  Alessandro Lazaric,et al.  LSTD with Random Projections , 2010, NIPS.

[2]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[3]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[4]  Csaba Szepesvári,et al.  Statistical linear estimation with penalized estimators: an application to reinforcement learning , 2012, ICML.

[5]  Michael A. Saunders,et al.  LSMR: An Iterative Algorithm for Sparse Least-Squares Problems , 2011, SIAM J. Sci. Comput..

[6]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[7]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[8]  Vicente Hernández,et al.  A robust and efficient parallel SVD solver based on restarted Lanczos bidiagonalization. , 2007 .

[9]  Alborz Geramifard,et al.  iLSTD: Eligibility Traces and Convergence Analysis , 2006, NIPS.

[10]  L. Mirsky SYMMETRIC GAUGE FUNCTIONS AND UNITARILY INVARIANT NORMS , 1960 .

[11]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[12]  Gene H. Golub,et al.  Calculating the singular values and pseudo-inverse of a matrix , 2007, Milestones in Matrix Computation.

[13]  Rich Sutton,et al.  A Deeper Look at Planning as Learning from Replay , 2015, ICML.

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[16]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[17]  M. Brand,et al.  Fast low-rank modifications of the thin singular value decomposition , 2006 .

[18]  R. Lathe Phd by thesis , 1988, Nature.

[19]  C. W. Groetsch,et al.  The theory of Tikhonov regularization for Fredholm equations of the first kind , 1984 .

[20]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[21]  Shie Mannor,et al.  Automatic basis function construction for approximate dynamic programming and reinforcement learning , 2006, ICML.

[22]  Csaba Szepesvari,et al.  Regularization in reinforcement learning , 2011 .

[23]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[24]  Matthew W. Hoffman,et al.  Finite-Sample Analysis of Lasso-TD , 2011, ICML.

[25]  Alessandro Lazaric,et al.  Finite-Sample Analysis of LSTD , 2010, ICML.

[26]  Nathan Srebro,et al.  Stochastic optimization for PCA and PLS , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[27]  Alborz Geramifard,et al.  Sigma point policy iteration , 2008, AAMAS.

[28]  Alborz Geramifard,et al.  Incremental Least-Squares Temporal Difference Learning , 2006, AAAI.

[29]  P. Hansen The discrete picard condition for discrete ill-posed problems , 1990 .

[30]  Rémi Munos,et al.  Fast LSTD Using Stochastic Approximation: Finite Time Analysis and Application to Traffic Control , 2013, ECML/PKDD.

[31]  Bruno Scherrer,et al.  Rate of Convergence and Error Bounds for LSTD(λ) , 2014, ICML 2015.

[32]  Richard S. Sutton,et al.  True Online TD(lambda) , 2014, ICML.

[33]  HansenPer Christian The truncated SVD as a method for regularization , 1987 .

[34]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[35]  Daniel F. Salas,et al.  Benchmarking a Scalable Approximate Dynamic Programming Algorithm for Stochastic Control of Multidimensional Energy Storage Problems , 2013 .