Sparse value function approximation for reinforcement learning

A key component of many reinforcement learning (RL) algorithms is the approximation of the value function. The design and selection of features for approximation in RL is crucial, and an ongoing area of research. One approach to the problem of feature selection is to apply sparsity-inducing techniques in learning the value function approximation; such sparse methods tend to select relevant features and ignore irrelevant features, thus automating the feature selection process. This dissertation describes three contributions in the area of sparse value function approximation for reinforcement learning. One method for obtaining sparse linear approximations is the inclusion in the objective function of a penalty on the sum of the absolute values of the approximation weights. This L1 regularization approach was first applied to temporal difference learning in the LARS-inspired, batch learning algorithm LARS-TD. In our first contribution, we define an iterative update equation which has as its fixed point the L 1 regularized linear fixed point of LARS-TD. The iterative update gives rise naturally to an online stochastic approximation algorithm. We prove convergence of the online algorithm and show that the L1 regularized linear fixed point is an equilibrium fixed point of the algorithm. We demonstrate the ability of the algorithm to converge to the fixed point, yielding a sparse solution with modestly better performance than unregularized linear temporal difference learning. Our second contribution extends LARS-TD to integrate policy optimization with sparse value learning. We extend the L1 regularized linear fixed point to include a maximum over policies, defining a new, "greedy" fixed point. The greedy fixed point adds a new invariant to the set which LARS-TD maintains as it traverses its homotopy path, giving rise to a new algorithm integrating sparse value learning and optimization. The new algorithm is demonstrated to be similar in performance with policy iteration using LARS-TD. Finally, we consider another approach to sparse learning, that of using a simple algorithm that greedily adds new features. Such algorithms have many of the good properties of the L1 regularization methods, while also being extremely efficient and, in some cases, allowing theoretical guarantees on recovery of the true form of a sparse target function from sampled data. We consider variants of orthogonal matching pursuit (OMP) applied to RL. The resulting algorithms are analyzed and compared experimentally with existing L1 regularized approaches. We demonstrate that perhaps the most natural scenario in which one might hope to achieve sparse recovery fails; however, one variant provides promising theoretical guarantees under certain assumptions on the feature dictionary while another variant empirically outperforms prior methods both in approximation accuracy and efficiency on several benchmark problems.

[1]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[2]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[3]  Matthew W. Hoffman,et al.  Regularized Least Squares Temporal Difference Learning with Nested ℓ2 and ℓ1 Penalization , 2011, EWRL.

[4]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[5]  Sridhar Mahadevan,et al.  Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[6]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale $\ell_1$-Regularized Least Squares , 2007, IEEE Journal of Selected Topics in Signal Processing.

[7]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[10]  Xin Xu,et al.  Kernel Least-Squares Temporal Difference Learning , 2006 .

[11]  L. Richardson The Approximate Arithmetical Solution by Finite Differences of Physical Problems Involving Differential Equations, with an Application to the Stresses in a Masonry Dam , 1911 .

[12]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[13]  Marek Petrik,et al.  Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes , 2010, ICML.

[14]  S. Mahadevan,et al.  Basis construction and utilization for markov decision processes using graphs , 2010 .

[15]  John Langford,et al.  Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[16]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[17]  Matthew W. Hoffman,et al.  Finite-Sample Analysis of Lasso-TD , 2011, ICML.

[18]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[19]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[20]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[21]  Mark W. Schmidt,et al.  An interior-point stochastic approximation method and an L1-regularized delta rule , 2008, NIPS.

[22]  H. Robbins A Stochastic Approximation Method , 1951 .

[23]  Geoffrey J. Gordon Reinforcement Learning with Function Approximation Converges to a Region , 2000, NIPS.

[24]  Benjamin Van Roy Learning and value function approximation in complex decision processes , 1998 .

[25]  Tong Zhang,et al.  On the Consistency of Feature Selection using Greedy Least Squares Regression , 2009, J. Mach. Learn. Res..

[26]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[27]  Gavin Taylor,et al.  Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[28]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[29]  M. R. Osborne,et al.  A new approach to variable selection in least squares problems , 2000 .

[30]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[31]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[32]  Carl D. Meyer,et al.  Matrix Analysis and Applied Linear Algebra , 2000 .

[33]  Ronald E. Parr,et al.  L1 Regularized Linear Temporal Difference Learning , 2012 .

[34]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[35]  Ronald Parr,et al.  Linear Complementarity for Regularized Policy Evaluation and Improvement , 2010, NIPS.

[36]  M. Loth,et al.  Sparse Temporal Difference Learning Using LASSO , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[37]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[38]  Lihong Li,et al.  An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[39]  Lihong Li,et al.  Analyzing feature generation for value-function approximation , 2007, ICML '07.

[40]  Ronald Parr,et al.  Greedy Algorithms for Sparse Reinforcement Learning , 2012, ICML.

[41]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[42]  Alborz Geramifard,et al.  Online Discovery of Feature Dependencies , 2011, ICML.

[43]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.

[44]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[45]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[46]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[47]  Yin Zhang,et al.  Fixed-Point Continuation for l1-Minimization: Methodology and Convergence , 2008, SIAM J. Optim..

[48]  Wotao Yin,et al.  TR 0707 A Fixed-Point Continuation Method for ` 1-Regularized Minimization with Applications to Compressed Sensing , 2007 .

[49]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[50]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[51]  Joel A. Tropp,et al.  Greed is good: algorithmic results for sparse approximation , 2004, IEEE Transactions on Information Theory.

[52]  Stephen J. Wright,et al.  Simultaneous Variable Selection , 2005, Technometrics.

[53]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[54]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[55]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[56]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[57]  Bo Liu,et al.  Sparse Q-learning with Mirror Descent , 2012, UAI.

[58]  Sridhar Mahadevan,et al.  Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions , 2005, NIPS.

[59]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[60]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[61]  Shie Mannor,et al.  Automatic basis function construction for approximate dynamic programming and reinforcement learning , 2006, ICML.

[62]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[63]  Richard S. Sutton,et al.  Temporal-difference search in computer Go , 2012, Machine Learning.

[64]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[65]  Inderjit S. Dhillon,et al.  Orthogonal Matching Pursuit with Replacement , 2011, NIPS.