Training Sparse Neural Network by Constraining Synaptic Weight on Unit Lp Sphere

Sparse deep neural networks have shown their advantages over dense models with fewer parameters and higher computational efficiency. Here we demonstrate constraining the synaptic weights on unit Lp-sphere enables the flexibly control of the sparsity with p and improves the generalization ability of neural networks. Firstly, to optimize the synaptic weights constrained on unit Lp-sphere, the parameter optimization algorithm, Lp-spherical gradient descent (LpSGD) is derived from the augmented Empirical Risk Minimization condition, which is theoretically proved to be convergent. To understand the mechanism of how p affects Hoyer's sparsity, the expectation of Hoyer's sparsity under the hypothesis of gamma distribution is given and the predictions are verified at various p under different conditions. In addition, the"semi-pruning"and threshold adaptation are designed for topology evolution to effectively screen out important connections and lead the neural networks converge from the initial sparsity to the expected sparsity. Our approach is validated by experiments on benchmark datasets covering a wide range of domains. And the theoretical analysis pave the way to future works on training sparse neural networks with constrained optimization.

[1]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[2]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[4]  Yi Zhou,et al.  Convergence of SGD in Learning ReLU Models with Separable Data , 2018, ArXiv.

[5]  Changsheng Zhou,et al.  Lp-WGAN: Using Lp-norm normalization to stabilize Wasserstein generative adversarial networks , 2018, Knowl. Based Syst..

[6]  Emile Fiesler,et al.  Evaluating pruning methods , 1995 .

[7]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[8]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  J. Hörandel,et al.  COSMIC RAYS FROM THE KNEE TO THE SECOND , 2007 .

[10]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[11]  Erich Elsen,et al.  Fast Sparse ConvNets , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Yuan-Hai Shao,et al.  Robust Lp-norm least squares support vector regression with feature selection , 2017, Appl. Math. Comput..

[13]  Peter Stone,et al.  Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science , 2017, Nature Communications.

[14]  Erich Elsen,et al.  Exploring Sparsity in Recurrent Neural Networks , 2017, ICLR.

[15]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[16]  S. Strogatz Exploring complex networks , 2001, Nature.

[17]  Soumava Kumar Roy,et al.  Constrained Stochastic Gradient Descent: The Good Practice , 2017, 2017 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[18]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[19]  R. Venkatesh Babu,et al.  Training Sparse Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[20]  Simo Saarakkala,et al.  Automatic Knee Osteoarthritis Diagnosis from Plain Radiographs: A Deep Learning-Based Approach , 2017, Scientific Reports.

[21]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[22]  L. Pessoa Understanding brain networks and brain organization. , 2014, Physics of life reviews.

[23]  Erich Elsen,et al.  Rigging the Lottery: Making All Tickets Winners , 2020, ICML.

[24]  Nikko Ström,et al.  Sparse connection and pruning in large dynamic artificial neural networks , 1997, EUROSPEECH.

[25]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[26]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[27]  Nikolaos Doulamis,et al.  Deep Learning for Computer Vision: A Brief Review , 2018, Comput. Intell. Neurosci..

[28]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[29]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[30]  Ian Stavness,et al.  Sparseout: Controlling Sparsity in Deep Networks , 2019, Canadian Conference on AI.

[31]  Gianluca Francini,et al.  Learning Sparse Neural Networks via Sensitivity-Driven Regularization , 2018, NeurIPS.

[32]  Philip H. S. Torr,et al.  SNIP: Single-shot Network Pruning based on Connection Sensitivity , 2018, ICLR.

[33]  Qi Zhu,et al.  Adaptive feature weighting for robust Lp-norm sparse representation with application to biometric image classification , 2020, Int. J. Mach. Learn. Cybern..