A Relaxation Argument for Optimization in Neural Networks and Non-Convex Compressed Sensing

It has been observed in practical applications and in theoretical analysis that over-parametrization helps to find good minima in neural network training. Similarly, in this article we study widening and deepening neural networks by a relaxation argument so that the enlarged networks are rich enough to run $r$ copies of parts of the original network in parallel, without necessarily achieving zero training error as in over-parametrized scenarios. The partial copies can be combined in $r^\theta$ possible ways for layer width $\theta$. Therefore, the enlarged networks can potentially achieve the best training error of $r^\theta$ random initializations, but it is not immediately clear if this can be realized via gradient descent or similar training methods. The same construction can be applied to other optimization problems by introducing a similar layered structure. We apply this idea to non-convex compressed sensing, where we show that in some scenarios we can realize the $r^\theta$ times increased chance to obtain a global optimum by solving a convex optimization problem of dimension $r\theta$.

[1]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[2]  Wotao Yin,et al.  Improved Iteratively Reweighted Least Squares for Unconstrained Smoothed 퓁q Minimization , 2013, SIAM J. Numer. Anal..

[3]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[4]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[5]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[6]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[7]  S. Foucart,et al.  Sparsest solutions of underdetermined linear systems via ℓq-minimization for 0 , 2009 .

[8]  H. Rauhut,et al.  Interpolation via weighted $l_1$ minimization , 2013, 1308.0759.

[9]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[10]  G. Nemhauser,et al.  Integer Programming , 2020 .

[11]  Song Li,et al.  Restricted p–isometry property and its application for nonconvex compressive sensing , 2012, Adv. Comput. Math..

[12]  Yurii Nesterov,et al.  Lectures on Convex Optimization , 2018 .

[13]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[14]  Emmanuel J. Candès,et al.  Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information , 2004, IEEE Transactions on Information Theory.

[15]  Wotao Yin,et al.  Iteratively reweighted algorithms for compressive sensing , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[17]  Qiyu Sun,et al.  Recovery of sparsest signals via ℓq-minimization , 2010, ArXiv.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Shiva Prasad Kasiviswanathan,et al.  Restricted Isometry Property under High Correlations , 2019, ArXiv.

[20]  I. Daubechies,et al.  Iteratively reweighted least squares minimization for sparse recovery , 2008, 0807.0575.

[21]  Holger Rauhut,et al.  A Mathematical Introduction to Compressive Sensing , 2013, Applied and Numerical Harmonic Analysis.

[22]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[23]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[24]  R. DeVore,et al.  A Simple Proof of the Restricted Isometry Property for Random Matrices , 2008 .

[25]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[26]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[27]  Yinyu Ye,et al.  A note on the complexity of Lp minimization , 2011, Math. Program..

[28]  Rick Chartrand,et al.  Compressed sensing recovery via nonconvex shrinkage penalties , 2015, ArXiv.

[29]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[30]  R. Chartrand,et al.  Restricted isometry properties and nonconvex compressive sensing , 2007 .

[31]  C. Villani Topics in Optimal Transportation , 2003 .

[32]  Stephen P. Boyd,et al.  Enhancing Sparsity by Reweighted ℓ1 Minimization , 2007, 0711.1612.

[33]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[34]  E. Candès,et al.  Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.