On the Minimax Risk of Dictionary Learning

We consider the problem of learning a dictionary matrix from a number of observed signals, which are assumed to be generated via a linear model with a common underlying dictionary. In particular, we derive lower bounds on the minimum achievable worst case mean squared error (MSE), regardless of computational complexity of the dictionary learning (DL) schemes. By casting DL as a classical (or frequentist) estimation problem, the lower bounds on the worst case MSE are derived following an established information-theoretic approach to minimax estimation. The main contribution of this paper is the adaption of these information-theoretic tools to the DL problem in order to derive lower bounds on the worst case MSE of any DL algorithm. We derive three different lower bounds applying to different generative models for the observed signals. The first bound only requires the existence of a covariance matrix of the (unknown) underlying coefficient vector. By specializing this bound to the case of sparse coefficient distributions and assuming the true dictionary satisfies the restricted isometry property, we obtain a lower bound on the worst case MSE of DL methods in terms of the signal-to-noise ratio (SNR). The third bound applies to a more restrictive subclass of coefficient distributions by requiring the non-zero coefficients to be Gaussian. Although the applicability of this bound is the most limited, it is the tightest of the three bounds in the low SNR regime. A particular use of our lower bounds is the derivation of necessary conditions on the required number of observations (sample size), such that DL is feasible, i.e., accurate DL schemes might exist. By comparing these necessary conditions with sufficient conditions on the sample size such that a particular DL technique is successful, we are able to characterize the regimes, where those algorithms are optimal in terms of required sample size.

[1]  Florent Krzakala,et al.  Phase diagram and approximate message passing for blind calibration and dictionary learning , 2013, 2013 IEEE International Symposium on Information Theory.

[2]  Levent Tunçel,et al.  Optimization algorithms on matrix manifolds , 2009, Math. Comput..

[3]  B. C. Ng,et al.  On the Cramer-Rao bound under parametric constraints , 1998, IEEE Signal Processing Letters.

[4]  Andrea Montanari,et al.  Message-passing algorithms for compressed sensing , 2009, Proceedings of the National Academy of Sciences.

[5]  Yonina C. Eldar Rethinking Biased Estimation: Improving Maximum Likelihood and the Cramér-Rao Bound , 2008, Found. Trends Signal Process..

[6]  Sanjeev Arora,et al.  New Algorithms for Learning Incoherent and Overcomplete Dictionaries , 2013, COLT.

[7]  Emmanuel J. Candès,et al.  Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information , 2004, IEEE Transactions on Information Theory.

[8]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.

[9]  Martin J. Wainwright,et al.  Information-Theoretic Limits on Sparsity Recovery in the High-Dimensional and Noisy Setting , 2007, IEEE Transactions on Information Theory.

[10]  Karin Schnass,et al.  On the Identifiability of Overcomplete Dictionaries via the Minimisation Principle Underlying K-SVD , 2013, ArXiv.

[11]  K. Mardia,et al.  Maximum likelihood estimation of models for residual covariance in spatial regression , 1984 .

[12]  R.G. Baraniuk,et al.  Compressive Sensing [Lecture Notes] , 2007, IEEE Signal Processing Magazine.

[13]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[14]  Bin Yu Assouad, Fano, and Le Cam , 1997 .

[15]  Daniel Choquet,et al.  The data deluge , 2012, Nature Cell Biology.

[16]  Rémi Gribonval,et al.  Sample Complexity of Dictionary Learning and Other Matrix Factorizations , 2013, IEEE Transactions on Information Theory.

[17]  Shie Mannor,et al.  The Sample Complexity of Dictionary Learning , 2010, COLT.

[18]  Harrison H. Zhou,et al.  OPTIMAL RATES OF CONVERGENCE FOR SPARSE COVARIANCE MATRIX ESTIMATION , 2012, 1302.3030.

[19]  Guillermo Sapiro,et al.  Non-local sparse models for image restoration , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[20]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[21]  Martin J. Wainwright,et al.  Information-Theoretic Limits of Selecting Binary Graphical Models in High Dimensions , 2009, IEEE Transactions on Information Theory.

[22]  Jean-Philippe Thiran,et al.  Lower and upper bounds for approximation of the Kullback-Leibler divergence between Gaussian Mixture Models , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Yonina C. Eldar,et al.  Minimum Variance Estimation of a Sparse Vector Within the Linear Gaussian Model: An RKHS Approach , 2014, IEEE Transactions on Information Theory.

[24]  Michael Elad,et al.  Image Sequence Denoising via Sparse and Redundant Representations , 2009, IEEE Transactions on Image Processing.

[25]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[26]  Huan Wang,et al.  Exact Recovery of Sparsely-Used Dictionaries , 2012, COLT.

[27]  A Special Report on Managing Information , 2022 .

[28]  Francis R. Bach,et al.  Structured Sparse Principal Component Analysis , 2009, AISTATS.

[29]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[30]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[31]  Holger Rauhut,et al.  A Mathematical Introduction to Compressive Sensing , 2013, Applied and Numerical Harmonic Analysis.

[32]  Yonina C. Eldar Sampling Theory: Beyond Bandlimited Systems , 2015 .

[33]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[34]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[35]  Mike E. Davies,et al.  Dictionary Learning for Sparse Approximations With the Majorization Method , 2009, IEEE Transactions on Signal Processing.

[36]  Prateek Jain,et al.  Learning Sparsely Used Overcomplete Dictionaries via Alternating Minimization , 2013, SIAM J. Optim..

[37]  Volkan Cevher,et al.  Bilinear Generalized Approximate Message Passing—Part I: Derivation , 2013, IEEE Transactions on Signal Processing.

[38]  Gabriel Peyré,et al.  Sparse Modeling of Textures , 2009, Journal of Mathematical Imaging and Vision.

[39]  Jean Ponce,et al.  Convex Sparse Matrix Factorizations , 2008, ArXiv.

[40]  E. Candès The restricted isometry property and its implications for compressed sensing , 2008 .

[41]  Michael Elad,et al.  Sparse Representation for Color Image Restoration , 2008, IEEE Transactions on Image Processing.

[42]  Martin J. Wainwright,et al.  Information-theoretic bounds on model selection for Gaussian Markov random fields , 2010, 2010 IEEE International Symposium on Information Theory.

[43]  Barak A. Pearlmutter,et al.  Blind Source Separation by Sparse Decomposition in a Signal Dictionary , 2001, Neural Computation.

[44]  Volkan Cevher,et al.  Bilinear Generalized Approximate Message Passing—Part II: Applications , 2014, IEEE Transactions on Signal Processing.

[45]  Pascal Frossard,et al.  Dictionary Learning for Stereo Image Representation , 2011, IEEE Transactions on Image Processing.

[46]  Emmanuel J. Cand The Restricted Isometry Property and Its Implications for Compressed Sensing , 2008 .

[47]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[48]  Pierre-Antoine Absil,et al.  Joint Diagonalization on the Oblique Manifold for Independent Component Analysis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[49]  T. Blumensath,et al.  Theory and Applications , 2011 .

[50]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[51]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[52]  Emmanuel J. Candès,et al.  How well can we estimate a sparse vector? , 2011, ArXiv.

[53]  A Special Report on Managing Information , 2022 .

[54]  Mehmet Türkan,et al.  Online dictionaries for image prediction , 2011, 2011 18th IEEE International Conference on Image Processing.

[55]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[56]  Rémi Gribonval,et al.  Local stability and robustness of sparse dictionary learning in the presence of noise , 2012, ArXiv.

[57]  J. Norris Appendix: probability and measure , 1997 .

[58]  Karin Schnass,et al.  Dictionary Identification—Sparse Matrix-Factorization via $\ell_1$ -Minimization , 2009, IEEE Transactions on Information Theory.

[59]  Amos Lapidoth,et al.  Capacity bounds via duality with applications to multiple-antenna systems on flat-fading channels , 2003, IEEE Trans. Inf. Theory.