A Unified Framework for Training Neural Networks

The lack of mathematical tractability of Deep Neural Networks (DNNs) has hindered progress towards having a unified convergence analysis of training algorithms, in the general setting. We propose a unified optimization framework for training different types of DNNs, and establish its convergence for arbitrary loss, activation, and regularization functions, assumed to be smooth. We show that framework generalizes well-known first- and second-order training methods, and thus allows us to show the convergence of these methods for various DNN architectures and learning tasks, as a special case of our approach. We discuss some of its applications in training various DNN architectures (e.g., feed-forward, convolutional, linear networks), to regression and classification tasks.

[1]  Zhi-Quan Luo,et al.  Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization , 2014, NIPS.

[2]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[3]  Zheng Xu,et al.  Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.

[4]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[5]  Ting-Kam Leonard Wong,et al.  Exponentially concave functions and a new information geometry , 2016, ArXiv.

[6]  Nishant Mehta,et al.  Fast rates with high probability in exp-concave statistical learning , 2016, AISTATS.

[7]  Julien Mairal,et al.  Optimization with First-Order Surrogate Functions , 2013, ICML.

[8]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  P. Tseng Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Ziming Zhang,et al.  Convergent Block Coordinate Descent for Training Tikhonov Regularized Deep Neural Networks , 2017, NIPS.

[13]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[14]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[15]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[16]  Martin J. Wainwright,et al.  On the Learnability of Fully-Connected Neural Networks , 2017, AISTATS.

[17]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[18]  Simone Scardapane,et al.  Parallel and distributed training of neural networks via successive convex approximation , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[19]  Francisco Facchinei,et al.  Parallel Algorithms for Big Data Optimization , 2014, ArXiv.

[20]  G. Lewicki,et al.  Approximation by Superpositions of a Sigmoidal Function , 2003 .

[21]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[22]  Rong Jin,et al.  Excess Risk Bounds for Exponentially Concave Losses , 2014, ArXiv.

[23]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[24]  Hao Yu,et al.  Levenberg—Marquardt Training , 2011 .

[25]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[26]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[27]  Michael Möller,et al.  Proximal Backpropagation , 2017, ICLR.

[28]  Suvrit Sra,et al.  Global optimality conditions for deep neural networks , 2017, ICLR.

[29]  Olvi L. Mangasarian,et al.  Backpropagation Convergence via Deterministic Nonmonotone Perturbed Minimization , 1993, NIPS.

[30]  Haihao Lu,et al.  Depth Creates No Bad Local Minima , 2017, ArXiv.

[31]  Yuandong Tian,et al.  An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[32]  Zhi-Quan Luo,et al.  A Unified Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization , 2012, SIAM J. Optim..

[33]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[34]  Maria Gabriela Eberle,et al.  Finding the closest Toeplitz matrix , 2003 .

[35]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[36]  Shai Shalev-Shwartz,et al.  Beyond Convexity: Stochastic Quasi-Convex Optimization , 2015, NIPS.

[37]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.