Understanding the Learned Iterative Soft Thresholding Algorithm with matrix factorization

Sparse coding is a core building block in many data analysis and machine learning pipelines. Typically it is solved by relying on generic optimization techniques, such as the Iterative Soft Thresholding Algorithm and its accelerated version (ISTA, FISTA). These methods are optimal in the class of first-order methods for non-smooth, convex functions. However, they do not exploit the particular structure of the problem at hand nor the input data distribution. An acceleration using neural networks, coined LISTA, was proposed in Gregor and Le Cun (2010), which showed empirically that one could achieve high quality estimates with few iterations by modifying the parameters of the proximal splitting appropriately. In this paper we study the reasons for such acceleration. Our mathematical analysis reveals that it is related to a specific matrix factorization of the Gram kernel of the dictionary, which attempts to nearly diagonalise the kernel with a basis that produces a small perturbation of the $\ell_1$ ball. When this factorization succeeds, we prove that the resulting splitting algorithm enjoys an improved convergence bound with respect to the non-adaptive version. Moreover, our analysis also shows that conditions for acceleration occur mostly at the beginning of the iterative process, consistent with numerical experiments. We further validate our analysis by showing that on dictionaries where this factorization does not exist, adaptive acceleration fails.

[1]  Andrew Y. Ng,et al.  The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization , 2011, ICML.

[2]  Jean-Baptiste Hiriart-Urruty,et al.  How to regularize a difference of convex functions , 1991 .

[3]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[4]  Sébastien Bubeck,et al.  Theory of Convex Optimization for Machine Learning , 2014, ArXiv.

[5]  Michael I. Jordan,et al.  Computational and statistical tradeoffs via convex relaxation , 2012, Proceedings of the National Academy of Sciences.

[6]  S. Osher,et al.  Coordinate descent optimization for l 1 minimization with application to compressed sensing; a greedy algorithm , 2009 .

[7]  Guillermo Sapiro,et al.  Learning Efficient Structured Sparse Models , 2012, ICML.

[8]  Yann LeCun,et al.  Learning Fast Approximations of Sparse Coding , 2010, ICML.

[9]  Benjamin Recht,et al.  Sharp Time–Data Tradeoffs for Linear Inverse Problems , 2015, IEEE Transactions on Information Theory.

[10]  Martin J. Wainwright,et al.  Randomized sketches for kernels: Fast and optimal non-parametric regression , 2015, ArXiv.

[11]  D. Song,et al.  _{}-norm uniform distribution , 1997 .

[12]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[13]  Bruno A. Olshausen,et al.  Learning Intermediate-Level Representations of Form and Motion from Natural Movies , 2012, Neural Computation.

[14]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[15]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[16]  Yonina C. Eldar,et al.  Tradeoffs Between Convergence Speed and Reconstruction Accuracy in Inverse Problems , 2016, IEEE Transactions on Signal Processing.

[17]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[18]  T. Hesterberg,et al.  Least angle and ℓ1 penalized regression: A review , 2008, 0802.0964.

[19]  C. O’Brien Statistical Learning with Sparsity: The Lasso and Generalizations , 2016 .

[20]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..