An improved analysis of the ER-SpUD dictionary learning algorithm

In "dictionary learning" we observe $Y = AX + E$ for some $Y\in\mathbb{R}^{n\times p}$, $A \in\mathbb{R}^{m\times n}$, and $X\in\mathbb{R}^{m\times p}$. The matrix $Y$ is observed, and $A, X, E$ are unknown. Here $E$ is "noise" of small norm, and $X$ is column-wise sparse. The matrix $A$ is referred to as a {\em dictionary}, and its columns as {\em atoms}. Then, given some small number $p$ of samples, i.e.\ columns of $Y$, the goal is to learn the dictionary $A$ up to small error, as well as $X$. The motivation is that in many applications data is expected to sparse when represented by atoms in the "right" dictionary $A$ (e.g.\ images in the Haar wavelet basis), and the goal is to learn $A$ from the data to then use it for other applications. Recently, [SWW12] proposed the dictionary learning algorithm ER-SpUD with provable guarantees when $E = 0$ and $m = n$. They showed if $X$ has independent entries with an expected $s$ non-zeroes per column for $1 \lesssim s \lesssim \sqrt{n}$, and with non-zero entries being subgaussian, then for $p\gtrsim n^2\log^2 n$ with high probability ER-SpUD outputs matrices $A', X'$ which equal $A, X$ up to permuting and scaling columns (resp.\ rows) of $A$ (resp.\ $X$). They conjectured $p\gtrsim n\log n$ suffices, which they showed was information theoretically necessary for {\em any} algorithm to succeed when $s \simeq 1$. Significant progress was later obtained in [LV15]. We show that for a slight variant of ER-SpUD, $p\gtrsim n\log(n/\delta)$ samples suffice for successful recovery with probability $1-\delta$. We also show that for the unmodified ER-SpUD, $p\gtrsim n^{1.99}$ samples are required even to learn $A, X$ with polynomially small success probability. This resolves the main conjecture of [SWW12], and contradicts the main result of [LV15], which claimed that $p\gtrsim n\log^4 n$ guarantees success whp.

[1]  Kyle Luh,et al.  Random Matrices: l1 Concentration and Dictionary Learning with Few Samples , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[2]  Huan Wang,et al.  Exact Recovery of Sparsely-Used Dictionaries , 2012, COLT.

[3]  Aditya Bhaskara,et al.  More Algorithms for Provable Dictionary Learning , 2014, ArXiv.

[4]  David Steurer,et al.  Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method , 2014, STOC.

[5]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[6]  Sjoerd Dirksen,et al.  Tail bounds via generic chaining , 2013, ArXiv.

[7]  Michael Elad,et al.  Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries , 2006, IEEE Transactions on Image Processing.

[8]  Prateek Jain,et al.  Learning Sparsely Used Overcomplete Dictionaries , 2014, COLT.

[9]  Alan M. Frieze,et al.  Learning linear transformations , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[10]  Jean Ponce,et al.  Sparse Modeling for Image and Vision Processing , 2014, Found. Trends Comput. Graph. Vis..

[11]  Radoslaw Adamczak,et al.  A Note on the Sample Complexity of the Er-SpUD Algorithm by Spielman, Wang and Wright for Exact Recovery of Sparsely Used Dictionaries , 2016, J. Mach. Learn. Res..

[12]  John Wright,et al.  Complete dictionary recovery over the sphere , 2015, 2015 International Conference on Sampling Theory and Applications (SampTA).

[13]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[14]  Phong Q. Nguyen,et al.  Learning a Parallelepiped: Cryptanalysis of GGH and NTRU Signatures , 2009, Journal of Cryptology.

[15]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[16]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[17]  Sanjeev Arora,et al.  Provable ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures and Autoencoders , 2012, Algorithmica.

[18]  Yong Xu,et al.  Sparse Representation for Brain Signal Processing: A tutorial on methods and applications , 2014, IEEE Signal Processing Magazine.

[19]  Sanjeev Arora,et al.  New Algorithms for Learning Incoherent and Overcomplete Dictionaries , 2013, COLT.

[20]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[21]  M. Talagrand Upper and Lower Bounds for Stochastic Processes: Modern Methods and Classical Problems , 2014 .

[22]  Mikhail Belkin,et al.  Blind Signal Separation in the Presence of Gaussian Noise , 2012, COLT.

[23]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[24]  Santosh S. Vempala,et al.  Fourier PCA and robust tensor decomposition , 2013, STOC.

[25]  Santosh S. Vempala,et al.  Max vs Min: Tensor Decomposition and ICA with nearly Linear Sample Complexity , 2014, COLT.

[26]  Michael Elad,et al.  Compression of facial images using the K-SVD algorithm , 2008, J. Vis. Commun. Image Represent..

[27]  Guillermo Sapiro,et al.  Non-local sparse models for image restoration , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[28]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.