A phase transition for finding needles in nonlinear haystacks with LASSO artificial neural networks

To fit sparse linear associations, a LASSO sparsity inducing penalty with a single hyperparameter provably allows to recover the important features (needles) with high probability in certain regimes even if the sample size is smaller than the dimension of the input vector (haystack). More recently learners known as artificial neural networks (ANN) have shown great successes in many machine learning tasks, in particular fitting nonlinear associations. Small learning rate, stochastic gradient descent algorithm and large training set help to cope with the explosion in the number of parameters present in deep neural networks. Yet few ANN learners have been developed and studied to find needles in nonlinear haystacks. Driven by a single hyperparameter, our ANN learner, like for sparse linear associations, exhibits a phase transition in the probability of retrieving the needles, which we do not observe with other ANN learners. To select our penalty parameter, we generalize the universal threshold of Donoho and Johnstone (1994) which is a better rule than the conservative (too many false detections) and expensive cross-validation. In the spirit of simulated annealing, we propose a warm-start sparsity inducing algorithm to solve the high-dimensional, non-convex and non-differentiable optimization problem. We perform precise Monte Carlo simulations to show the effectiveness of our approach.

[1]  Mykola Pechenizkiy,et al.  Truly Sparse Neural Networks at Scale , 2021, ArXiv.

[2]  David L. Donoho,et al.  Precise Undersampling Theorems , 2010, Proceedings of the IEEE.

[3]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[4]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[5]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[6]  Glenn Stone Statistics for High‐Dimensional Data: Methods, Theory and Applications. By Peter Buhlmann and Sara van de Geer. Springer, Berlin, Heidelberg. 2011. xvii+556 pages. €104.99 (hardback). ISBN 978‐3‐642‐20191‐2. , 2013 .

[7]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[8]  Pushmeet Kohli,et al.  Memory Bounded Deep Convolutional Networks , 2014, ArXiv.

[9]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[10]  Miguel Á. Carreira-Perpiñán,et al.  "Learning-Compression" Algorithms for Neural Net Pruning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Marc'Aurelio Ranzato,et al.  Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[12]  I. Johnstone,et al.  Wavelet Shrinkage: Asymptopia? , 1995 .

[13]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[14]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[15]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[16]  Wyeth W. Wasserman,et al.  Deep Feature Selection: Theory and Application to Identify Enhancers and Promoters , 2015, RECOMB.

[17]  Hong Chen,et al.  Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems , 1995, IEEE Trans. Neural Networks.

[18]  F. Bach,et al.  Optimization with Sparsity-Inducing Penalties (Foundations and Trends(R) in Machine Learning) , 2011 .

[19]  Peng Zhang,et al.  Transformed 𝓁1 Regularization for Learning Sparse Deep Neural Networks , 2019, Neural Networks.

[20]  Helmut Bölcskei,et al.  Deep Neural Network Approximation Theory , 2019, IEEE Transactions on Information Theory.

[21]  Yoram Bresler,et al.  Online Sparsifying Transform Learning— Part I: Algorithms , 2015, IEEE Journal of Selected Topics in Signal Processing.

[22]  Levent Sagun,et al.  Scaling description of generalization with number of parameters in deep learning , 2019, Journal of Statistical Mechanics: Theory and Experiment.

[23]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[24]  N. Simon,et al.  Sparse-Input Neural Networks for High-dimensional Nonparametric Regression and Classification , 2017, 1711.07592.

[25]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[26]  Dimche Kostadinov,et al.  Learning Overcomplete and Sparsifying Transform With Approximate and Exact Closed Form Solutions , 2018, 2018 7th European Workshop on Visual Information Processing (EUVIP).

[27]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[28]  A. Belloni,et al.  Square-Root Lasso: Pivotal Recovery of Sparse Signals via Conic Programming , 2010, 1009.5689.

[29]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[30]  S. Sardy,et al.  Quantile universal threshold , 2017 .

[31]  S. Sardy,et al.  Model Selection With Lasso-Zero: Adding Straw to the Haystack to Better Find Needles , 2018, J. Comput. Graph. Stat..

[32]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[33]  Alexandros Kalousis,et al.  Regularising Non-linear Models Using Feature Side-information , 2017, ICML.

[34]  Paul W. Holland,et al.  Covariance Stabilizing Transformations , 1973 .

[35]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[36]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[37]  Helmut Bölcskei,et al.  Optimal Approximation with Sparsely Connected Deep Neural Networks , 2017, SIAM J. Math. Data Sci..

[38]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[39]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[40]  Mao Ye,et al.  Variable Selection via Penalized Neural Network: a Drop-Out-One Loss Approach , 2018, ICML.

[41]  J. Friedman,et al.  Projection Pursuit Regression , 1981 .

[42]  Guang Cheng,et al.  Directional Pruning of Deep Neural Networks , 2020, NeurIPS.

[43]  Erich Elsen,et al.  The Difficulty of Training Sparse Neural Networks , 2019, ArXiv.

[44]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[45]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[46]  I. Johnstone,et al.  Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences , 2004, math/0410088.

[47]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[48]  Andrea Montanari,et al.  The Noise-Sensitivity Phase Transition in Compressed Sensing , 2010, IEEE Transactions on Information Theory.

[49]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[50]  Peter Bühlmann,et al.  High-Dimensional Statistics with a View Toward Applications in Biology , 2014 .