Subgradient Descent Learns Orthogonal Dictionaries

This paper concerns dictionary learning, i.e., sparse coding, a fundamental representation learning problem. We show that a subgradient descent algorithm, with random initialization, can provably recover orthogonal dictionaries on a natural nonsmooth, nonconvex $\ell_1$ minimization formulation of the problem, under mild statistical assumptions on the data. This is in contrast to previous provable methods that require either expensive computation or delicate initialization schemes. Our analysis develops several tools for characterizing landscapes of nonsmooth functions, which might be of independent interest for provable training of deep networks with nonsmooth activations (e.g., ReLU), among numerous other applications. Preliminary experiments corroborate our analysis and show that our algorithm works well empirically in recovering orthogonal dictionaries.

[1]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[2]  Leon Hirsch,et al.  Fundamentals Of Convex Analysis , 2016 .

[3]  Gang Wang,et al.  Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization , 2018, IEEE Transactions on Signal Processing.

[4]  Teng Zhang,et al.  Robust Principal Component Analysis by Manifold Optimization , 2017 .

[5]  Chinmay Hegde,et al.  Learning ReLU Networks via Alternating Minimization , 2018, ArXiv.

[6]  Aravindan Vijayaraghavan,et al.  Towards Learning Sparsely Used Dictionaries with Arbitrary Supports , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[7]  Jean Ponce,et al.  Sparse Modeling for Image and Vision Processing , 2014, Found. Trends Comput. Graph. Vis..

[8]  Mahdi Soltanolkotabi,et al.  Structured Signal Recovery From Quadratic Measurements: Breaking Sample Complexity Barriers via Nonconvex Optimization , 2017, IEEE Transactions on Information Theory.

[9]  T. O’Neil Geometric Measure Theory , 2002 .

[10]  John Wright,et al.  Complete dictionary recovery over the sphere , 2015, 2015 International Conference on Sampling Theory and Applications (SampTA).

[11]  Frank E. Curtis,et al.  A quasi-Newton algorithm for nonconvex, nonsmooth optimization with global convergence guarantees , 2015, Math. Program. Comput..

[12]  John Wright,et al.  Finding a Sparse Vector in a Subspace: Linear Sparsity Using Alternating Directions , 2014, IEEE Transactions on Information Theory.

[13]  Fabiana R. de Oliveira,et al.  Newton Method for Finding a Singularity of a Special Class of Locally Lipschitz Continuous Vector Fields on Riemannian Manifolds , 2020, J. Optim. Theory Appl..

[14]  Thomas Laurent,et al.  The Multilinear Structure of ReLU Networks , 2017, ICML.

[15]  Reinhard Heckel,et al.  Deep Denoising: Rate-Optimal Recovery of Structured Signals with a Deep Prior , 2018, ArXiv.

[16]  John Wright,et al.  Structured Local Optima in Sparse Blind Deconvolution , 2018, IEEE Transactions on Information Theory.

[17]  Adil M. Bagirov,et al.  Introduction to Nonsmooth Optimization: Theory, Practice and Software , 2014 .

[18]  Yanjun Li,et al.  Multichannel Sparse Blind Deconvolution on the Sphere , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Feng Ruan,et al.  Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval , 2017, Information and Inference: A Journal of the IMA.

[20]  S. Hosseini,et al.  OPTIMALITY CONDITIONS FOR GLOBAL MINIMA OF NONCONVEX FUNCTIONS ON RIEMANNIAN MANIFOLDS , 2014 .

[21]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[22]  Yoram Bresler,et al.  Learning Sparsifying Transforms , 2013, IEEE Transactions on Signal Processing.

[23]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[24]  Philipp Grohs,et al.  ε-subgradient algorithms for locally lipschitz functions on Riemannian manifolds , 2015, Advances in Computational Mathematics.

[25]  Dmitriy Drusvyatskiy,et al.  Subgradient Methods for Sharp Weakly Convex Functions , 2018, Journal of Optimization Theory and Applications.

[26]  Shai Shalev-Shwartz,et al.  SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[27]  J. Aubin Optima and Equilibria: An Introduction to Nonlinear Analysis , 1993 .

[28]  Sanjeev Arora,et al.  New Algorithms for Learning Incoherent and Overcomplete Dictionaries , 2013, COLT.

[29]  M. R. Pouryayevali,et al.  Generalized gradients and characterization of epi-Lipschitz sets in Riemannian manifolds , 2011 .

[30]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[31]  Yu. S. Ledyaev,et al.  Nonsmooth analysis on smooth manifolds , 2007 .

[32]  Adrian S. Lewis,et al.  A Robust Gradient Sampling Algorithm for Nonsmooth, Nonconvex Optimization , 2005, SIAM J. Optim..

[33]  Tengyu Ma,et al.  Polynomial-Time Tensor Decompositions with Sum-of-Squares , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[34]  Peter L. Bartlett,et al.  Alternating minimization for dictionary learning with random initialization , 2017, NIPS.

[35]  B. O'neill Semi-Riemannian Geometry With Applications to Relativity , 1983 .

[36]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[37]  Teng Zhang,et al.  Robust PCA by Manifold Optimization , 2017, J. Mach. Learn. Res..

[38]  Yonina C. Eldar,et al.  Solving Systems of Random Quadratic Equations via Truncated Amplitude Flow , 2016, IEEE Transactions on Information Theory.

[39]  Yuejie Chi,et al.  Reshaped Wirtinger Flow and Incremental Algorithm for Solving Quadratic System of Equations , 2016 .

[40]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[41]  Yi Zhou,et al.  Critical Points of Neural Networks: Analytical Forms and Landscape Properties , 2017, ArXiv.

[42]  Damek Davis,et al.  The nonsmooth landscape of phase retrieval , 2017, IMA Journal of Numerical Analysis.

[43]  F. Clarke Optimization And Nonsmooth Analysis , 1983 .

[44]  A. W. van der Vaart,et al.  A note on bounds for VC dimensions. , 2009, Institute of Mathematical Statistics collections.

[45]  Mahdi Soltanolkotabi,et al.  Learning ReLUs via Gradient Descent , 2017, NIPS.

[46]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[47]  Robert E. Mahony,et al.  Optimization Algorithms on Matrix Manifolds , 2007 .

[48]  Inderjit S. Dhillon,et al.  Learning Non-overlapping Convolutional Neural Networks with Multiple Kernels , 2017, ArXiv.

[49]  Xiao Zhang,et al.  Learning One-hidden-layer ReLU Networks via Gradient Descent , 2018, AISTATS.

[50]  Michael L. Overton,et al.  Gradient Sampling Methods for Nonsmooth Optimization , 2018, Numerical Nonsmooth Optimization.

[51]  Yanjun Li,et al.  Global Geometry of Multichannel Sparse Blind Deconvolution on the Sphere , 2018, NeurIPS.

[52]  Stephen P. Boyd,et al.  Subgradient Methods , 2007 .

[53]  Yonina C. Eldar,et al.  Convolutional Phase Retrieval via Gradient Descent , 2017, IEEE Transactions on Information Theory.

[54]  Rémi Gribonval,et al.  An L1 criterion for dictionary learning by subspace identification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[55]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[56]  Tselil Schramm,et al.  Fast and robust tensor decomposition with applications to dictionary learning , 2017, COLT.

[57]  Po-Ling Loh,et al.  Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima , 2013, J. Mach. Learn. Res..

[58]  Ilya Molchanov Foundations of stochastic geometry and theory of random sets , 2013 .

[59]  Vladislav Voroninski,et al.  Global Guarantees for Enforcing Deep Generative Priors by Empirical Risk , 2017, IEEE Transactions on Information Theory.

[60]  J. Hiriart-Urruty,et al.  Mean value theorems in nonsmooth analysis , 1980 .

[61]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[62]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[63]  Xiaojun Chen,et al.  Smoothing methods for nonsmooth, nonconvex minimization , 2012, Math. Program..

[64]  Samet Oymak,et al.  Stochastic Gradient Descent Learns State Equations with Nonlinear Activations , 2018, COLT.

[65]  P. Grohs,et al.  Nonsmooth trust region algorithms for locally Lipschitz functions on Riemannian manifolds , 2016 .

[66]  Reinhard Heckel,et al.  Rate-Optimal Denoising with Deep Neural Networks , 2018 .

[67]  E. Candès Mathematics of Sparsity (and a Few Other Things) , 2014 .

[68]  Yuandong Tian,et al.  Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.

[69]  Recovery Guarantees , 2009, Encyclopedia of Database Systems.

[70]  Xiao Li,et al.  Nonconvex Robust Low-rank Matrix Recovery , 2018, SIAM J. Optim..

[71]  Yuandong Tian,et al.  When is a Convolutional Filter Easy To Learn? , 2017, ICLR.

[72]  Dmitriy Drusvyatskiy,et al.  Uniform Graphical Convergence of Subgradients in Nonconvex Optimization and Learning , 2018, Math. Oper. Res..

[73]  David Steurer,et al.  Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method , 2014, STOC.

[74]  John Wright,et al.  When Are Nonconvex Problems Not Scary? , 2015, ArXiv.

[75]  Suvrit Sra,et al.  A Critical View of Global Optimality in Deep Learning , 2018, ArXiv.

[76]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[77]  Huan Wang,et al.  Exact Recovery of Sparsely-Used Dictionaries , 2012, COLT.

[78]  Sanjeev Arora,et al.  Simple, Efficient, and Neural Algorithms for Sparse Coding , 2015, COLT.

[79]  Suvrit Sra,et al.  Small nonlinearities in activation functions create bad local minima in neural networks , 2018, ICLR.

[80]  Tim Mitchell,et al.  A BFGS-SQP method for nonsmooth, nonconvex, constrained optimization and its evaluation using relative minimization profiles , 2017, Optim. Methods Softw..

[81]  Simon S. Du,et al.  Improved Learning of One-hidden-layer Convolutional Neural Networks with Overlaps , 2018, ArXiv.

[82]  André Uschmajew,et al.  A Riemannian Gradient Sampling Algorithm for Nonsmooth Optimization on Manifolds , 2017, SIAM J. Optim..

[83]  Praneeth Netrapalli,et al.  A Clustering Approach to Learning Sparsely Used Overcomplete Dictionaries , 2013, IEEE Transactions on Information Theory.

[84]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[85]  Yi Zheng,et al.  No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis , 2017, ICML.

[86]  Yuxin Chen,et al.  Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview , 2018, IEEE Transactions on Signal Processing.

[87]  Sham M. Kakade,et al.  Provably Correct Automatic Subdifferentiation for Qualified Programs , 2018, NeurIPS.

[88]  John Wright,et al.  Efficient Dictionary Learning with Gradient Descent , 2018, ICML.