SHRIMP: Sparser Random Feature Models via Iterative Magnitude Pruning

Sparse shrunk additive models and sparse random feature models have been developed separately as methods to learn low-order functions, where there are few interactions between variables, but neither offers computational efficiency. On the other hand, `2-based shrunk additive models are efficient but do not offer feature selection as the resulting coefficient vectors are dense. Inspired by the success of the iterative magnitude pruning technique in finding lottery tickets of neural networks, we propose a new method—Sparser Random Feature Models via IMP (ShRIMP)2—to efficiently fit high-dimensional data with inherent low-dimensional structure in the form of sparse variable dependencies. Our method can be viewed as a combined process to construct and find sparse lottery tickets for two-layer dense networks. We explain the observed benefit of SHRIMP through a refined analysis on the generalization error for thresholded Basis Pursuit and resulting bounds on eigenvalues. From function approximation experiments on both synthetic data and real-world benchmark datasets, we show that SHRIMP obtains better than or competitive test accuracy compared to state-of-art sparse feature and additive methods such as SRFE-S, SSAM, and SALSA. Meanwhile, SHRIMP performs feature selection with low computational complexity and is robust to the pruning rate, indicating a robustness in the structure of the obtained subnetworks. We gain insight into the lottery ticket hypothesis through SHRIMP by noting a correspondence between our model and weight/neuron subnetworks.

[1]  Zenglin Xu,et al.  Simple and Efficient Multiple Kernel Learning by Group Lasso , 2010, ICML.

[2]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[3]  Daniel Potts,et al.  Approximation of High-Dimensional Periodic Functions with Fourier-Based Methods , 2021, SIAM J. Numer. Anal..

[4]  Andrea Montanari,et al.  The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime , 2019 .

[5]  J. Horowitz,et al.  VARIABLE SELECTION IN NONPARAMETRIC ADDITIVE MODELS. , 2010, Annals of statistics.

[6]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[8]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[9]  Henryk Wozniakowski,et al.  On decompositions of multivariate functions , 2009, Math. Comput..

[10]  R. DeVore,et al.  Approximation of Functions of Few Variables in High Dimensions , 2011 .

[11]  Heng Huang,et al.  Sparse Shrunk Additive Models , 2020, ICML.

[12]  Gilad Yehudai,et al.  Proving the Lottery Ticket Hypothesis: Pruning is All You Need , 2020, ICML.

[13]  Tong Zhang,et al.  Learning Bounds for Kernel Regression Using Effective Data Dimensionality , 2005, Neural Computation.

[14]  Y. Teh,et al.  Lottery Tickets in Linear Models: An Analysis of Iterative Magnitude Pruning , 2020, ArXiv.

[15]  Shou-De Lin,et al.  Sparse Random Feature Algorithm as Coordinate Descent in Hilbert Space , 2014, NIPS.

[16]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[17]  Mila Nikolova,et al.  Description of the Minimizers of Least Squares Regularized with 퓁0-norm. Uniqueness of the Global Minimizer , 2013, SIAM J. Imaging Sci..

[18]  Daniel Potts,et al.  Interpretable Approximation of High-Dimensional Data , 2021, SIAM J. Math. Data Sci..

[19]  Kameron Decker Harris Additive function approximation in the brain , 2019, ArXiv.

[20]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[21]  T. Ishigami,et al.  An importance quantification technique in uncertainty analysis for computer models , 1990, [1990] Proceedings. First International Symposium on Uncertainty Modeling and Analysis.

[22]  Holger Rauhut,et al.  A Mathematical Introduction to Compressive Sensing , 2013, Applied and Numerical Harmonic Analysis.

[23]  Zhenyu Liao,et al.  A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent , 2020, NeurIPS.

[24]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[25]  Yaoliang Yu,et al.  Additive Approximations in High Dimensional Nonparametric Regression via the SALSA , 2016, ICML.

[26]  Linan Zhang,et al.  On the Convergence of the SINDy Algorithm , 2018, Multiscale Model. Simul..

[27]  J. Lafferty,et al.  Sparse additive models , 2007, 0711.4555.

[28]  Arthur Jacot,et al.  Implicit Regularization of Random Feature Models , 2020, ICML.

[29]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Ayca Ozcelikkale Sparse Recovery With Non-Linear Fourier Features , 2020, ICASSP 2020.

[31]  Jason Yosinski,et al.  Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask , 2019, NeurIPS.

[32]  Florent Krzakala,et al.  Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime , 2020, ICML.

[33]  Siddharth Krishna Kumar,et al.  On weight initialization in deep neural networks , 2017, ArXiv.

[34]  Colin Campbell,et al.  Kernel methods: a survey of current techniques , 2002, Neurocomputing.

[35]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[36]  Taiji Suzuki,et al.  Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint , 2020, ICLR.

[37]  Ameya Velingker,et al.  Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees , 2018, ICML.

[38]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[39]  Jinjun Xiong,et al.  Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity on Pruned Neural Networks , 2021, ArXiv.

[40]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[41]  Hayden Schaeffer,et al.  Conditioning of Random Feature Matrices: Double Descent and Generalization Error , 2021, ArXiv.

[42]  Laurent Orseau,et al.  Logarithmic Pruning is All You Need , 2020, NeurIPS.

[43]  Ufuk Topcu,et al.  Generalization bounds for sparse random feature expansions , 2021, Applied and Computational Harmonic Analysis.

[44]  Ankit Pensia,et al.  Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient , 2020, NeurIPS.

[45]  Xi Chen,et al.  Group Sparse Additive Models , 2012, ICML.

[46]  Zhangyang Wang,et al.  Efficient Lottery Ticket Finding: Less Data is More , 2021, ICML.

[47]  Ameet Talwalkar,et al.  On the Impact of Kernel Approximation on Learning Accuracy , 2010, AISTATS.

[48]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[49]  A. Rahimi,et al.  Uniform approximation of functions with random bases , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[50]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[51]  Ali Farhadi,et al.  What’s Hidden in a Randomly Weighted Neural Network? , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Ameet Talwalkar,et al.  Sampling Techniques for the Nystrom Method , 2009, AISTATS.

[53]  Michael P. Friedlander,et al.  Probing the Pareto Frontier for Basis Pursuit Solutions , 2008, SIAM J. Sci. Comput..

[54]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.