Convergence analyses on sparse feedforward neural networks via group lasso regularization

In this paper, a new variant of feedforward neural networks has been proposed for a class of nonsmooth optimization problems. The penalty term of the presented neural networks stems from the Group Lasso method which selects hidden variables in a grouped manner. To deal with the non-differentiability of the original penalty term ( l 1 - l 2 norm) and avoid oscillations, smoothing techniques have been used to approximate the objective function. It is assumed that the training samples are supplied to the networks in a specific incremental way during training, that is, in each cycle samples are supplied in a fixed order. Then, under suitable assumptions on learning rate, penalization coefficients and smoothing parameters, the weak and strong convergence of the training process for the smoothing neural networks have been proved. The convergence analysis shows that the gradient of the smoothing error function approaches zero and the weight sequence converges to a fixed point, respectively. We demonstrate how the smoothing approximation parameter can be updated in the training procedure so as to guarantee the convergence of the procedure to a Clarke stationary point of the original optimization problem. In addition, we have proved that the original nonsmoothing algorithm with l 1 - l 2 norm penalty converges consistently to the same optimum solution with the corresponding smoothed algorithm. Numerical simulations demonstrate the convergence and effectiveness of the proposed training algorithm.

[1]  Lizhong Wu,et al.  A Smoothing Regularizer for Feedforward and Recurrent Neural Networks , 1996, Neural Computation.

[2]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[3]  F. Clarke Optimization And Nonsmooth Analysis , 1983 .

[4]  Ya-Xiang Yuan,et al.  Optimization Theory and Methods: Nonlinear Programming , 2010 .

[5]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[6]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[7]  Wei Xu,et al.  Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent , 2011, ArXiv.

[8]  D. Rumelhart,et al.  Generalization by weight-elimination applied to currency exchange rate prediction , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[9]  Masafumi Hagiwara,et al.  A simple and effective method for removal of hidden units and weights , 1994, Neurocomputing.

[10]  Xiaojun Chen,et al.  Worst-Case Complexity of Smoothing Quadratic Regularization Methods for Non-Lipschitzian Optimization , 2013, SIAM J. Optim..

[11]  Nikhil R. Pal,et al.  A novel training scheme for multilayered perceptrons to realize proper generalization and incremental learning , 2003, IEEE Trans. Neural Networks.

[12]  Kenneth W. Bauer,et al.  Determining input features for multilayer perceptrons , 1995, Neurocomputing.

[13]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[14]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[15]  R. Setiono,et al.  Effective neural network pruning using cross-validation , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[16]  Xiaojun Chen,et al.  Smoothing Nonlinear Conjugate Gradient Method for Image Restoration Using Nonsmooth Nonconvex Minimization , 2010, SIAM J. Imaging Sci..

[17]  T. Kathirvalavakumar,et al.  Pruning algorithms of neural networks — a comparative study , 2013, Central European Journal of Computer Science.

[18]  T. Kathirvalavakumar,et al.  A Novel Pruning Algorithm for Optimizing Feedforward Neural Network of Classification Problems , 2011, Neural Processing Letters.

[19]  Simon Haykin,et al.  On Different Facets of Regularization Theory , 2002, Neural Computation.

[20]  Rudy Setiono,et al.  Use of a quasi-Newton method in a feedforward neural network construction algorithm , 1995, IEEE Trans. Neural Networks.

[21]  Y. Ye,et al.  Lower Bound Theory of Nonzero Entries in Solutions of ℓ2-ℓp Minimization , 2010, SIAM J. Sci. Comput..

[22]  Tom Heskes,et al.  A theoretical comparison of batch-mode, on-line, cyclic, and almost-cyclic learning , 1996, IEEE Trans. Neural Networks.

[23]  Panos J. Antsaklis,et al.  The dependence identification neural network construction algorithm , 1996, IEEE Trans. Neural Networks.

[24]  Zongben Xu,et al.  When Does Online BP Training Converge? , 2009, IEEE Transactions on Neural Networks.

[25]  D. Rumelhart,et al.  Generalization by weight-elimination applied to currency exchange rate prediction , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[26]  Wei Wu,et al.  Convergence of an online gradient method for feedforward neural networks with stochastic inputs , 2004 .

[27]  Wei Wu,et al.  Deterministic convergence of an online gradient method for neural networks , 2002 .

[28]  Xiaojun Chen,et al.  Smoothing Neural Network for Constrained Non-Lipschitz Optimization With Applications , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[29]  D. Bertsekas Nondifferentiable optimization via approximation , 1975 .

[30]  Xiao-Hua Yu,et al.  Neural network dimension selection for dynamical system identification , 2008, 2008 IEEE International Conference on Control Applications.

[31]  Mahmoud Z. Iskandarani A novel Approach to System Security using Derived Odor Keys with Weight Elimination Neural Algorithm (DOK-WENA) , 2014 .

[32]  Yih-Fang Huang,et al.  Bounds on the number of hidden neurons in multilayer perceptrons , 1991, IEEE Trans. Neural Networks.

[33]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[34]  Andries P. Engelbrecht,et al.  Optimizing the number of hidden nodes of a feedforward artificial neural network , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[35]  William W. Hsieh,et al.  Machine Learning Methods in the Environmental Sciences: Neural Networks and Kernels , 2009 .

[36]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[37]  Wei Bian,et al.  Subgradient-Based Neural Networks for Nonsmooth Nonconvex Optimization Problems , 2009, IEEE Transactions on Neural Networks.

[38]  Lorien Y. Pratt,et al.  Comparing Biases for Minimal Network Construction with Back-Propagation , 1988, NIPS.

[39]  Jianhua Lu,et al.  Analysis of Boundedness and Convergence of Online Gradient Method for Two-Layer Feedforward Neural Networks , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[40]  Andrew Chi-Sing Leung,et al.  Convergence Analyses on On-Line Weight Noise Injection-Based Training Algorithms for MLPs , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[41]  Andries Petrus Engelbrecht,et al.  A new pruning heuristic based on variance analysis of sensitivity information , 2001, IEEE Trans. Neural Networks.

[42]  Xiaojun Chen,et al.  Smoothing methods for nonsmooth, nonconvex minimization , 2012, Math. Program..

[43]  P May,et al.  A Comprehensive Evaluation of Weight Growth and Weight Elimination Methods Using the Tangent Plane Algorithm , 2013 .

[44]  L. Darrell Whitley,et al.  Genetic algorithms and neural networks: optimizing connections and connectivity , 1990, Parallel Comput..

[45]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[46]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[47]  Jacek M. Zurada,et al.  Boundedness and convergence of MPN for cyclic and almost cyclic learning with penalty , 2011, The 2011 International Joint Conference on Neural Networks.

[48]  Sheng Chen,et al.  Local regularization assisted orthogonal least squares regression , 2006, Neurocomputing.

[49]  Wei Wu,et al.  Boundedness and Convergence of Online Gradient Method With Penalty for Feedforward Neural Networks , 2009, IEEE Transactions on Neural Networks.

[50]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[51]  Rudy Setiono,et al.  A Penalty-Function Approach for Pruning Feedforward Neural Networks , 1997, Neural Computation.

[52]  John E. Moody,et al.  Smoothing Regularizers for Projective Basis Function Networks , 1996, NIPS.

[53]  Mauro Forti,et al.  Generalized neural network for nonsmooth nonlinear programming problems , 2004, IEEE Transactions on Circuits and Systems I: Regular Papers.

[54]  Jie Zhang,et al.  A Sequential Learning Approach for Single Hidden Layer Neural Networks , 1998, Neural Networks.

[55]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[56]  Jun Wang,et al.  Convergence Analysis of a Class of Nonsmooth Gradient Systems , 2008, IEEE Transactions on Circuits and Systems I: Regular Papers.

[57]  S. Stankovic,et al.  Learning in neural networks by normalized stochastic gradient algorithm: local convergence , 2000, Proceedings of the 5th Seminar on Neural Network Applications in Electrical Engineering. NEUREL 2000 (IEEE Cat. No.00EX287).

[58]  Wei Wu,et al.  Convergence analysis of online gradient method for BP neural networks , 2011, Neural Networks.

[59]  Xiaojun Chen,et al.  Neural Network for Nonsmooth, Nonconvex Constrained Minimization Via Smooth Approximation , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[60]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..