论文信息 - A Unified Framework for Training Neural Networks - 字舞流文

A Unified Framework for Training Neural Networks

The lack of mathematical tractability of Deep Neural Networks (DNNs) has hindered progress towards having a unified convergence analysis of training algorithms, in the general setting. We propose a unified optimization framework for training different types of DNNs, and establish its convergence for arbitrary loss, activation, and regularization functions, assumed to be smooth. We show that framework generalizes well-known first- and second-order training methods, and thus allows us to show the convergence of these methods for various DNN architectures and learning tasks, as a special case of our approach. We discuss some of its applications in training various DNN architectures (e.g., feed-forward, convolutional, linear networks), to regression and classification tasks.

Carlo Fischione | Mikael Skoglund | Hossein Shokri Ghadikolaei | Hadi G. Ghauch | C. Fischione | M. Skoglund | H. Ghauch | H. S. Ghadikolaei

[1] Zhi-Quan Luo,et al. Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization , 2014, NIPS.

[2] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[3] Zheng Xu,et al. Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.

[4] Inderjit S. Dhillon,et al. Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[5] Ting-Kam Leonard Wong,et al. Exponentially concave functions and a new information geometry , 2016, ArXiv.

[6] Nishant Mehta,et al. Fast rates with high probability in exp-concave statistical learning , 2016, AISTATS.

[7] Julien Mairal,et al. Optimization with First-Order Surrogate Functions , 2013, ICML.

[8] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[9] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10] P. Tseng. Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[11] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12] Ziming Zhang,et al. Convergent Block Coordinate Descent for Training Tikhonov Regularized Deep Neural Networks , 2017, NIPS.

[13] A. Montanari,et al. The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[14] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[15] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[16] Martin J. Wainwright,et al. On the Learnability of Fully-Connected Neural Networks , 2017, AISTATS.

[17] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[18] Simone Scardapane,et al. Parallel and distributed training of neural networks via successive convex approximation , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[19] Francisco Facchinei,et al. Parallel Algorithms for Big Data Optimization , 2014, ArXiv.

[20] G. Lewicki,et al. Approximation by Superpositions of a Sigmoidal Function , 2003 .

[21] Radford M. Neal. Pattern Recognition and Machine Learning , 2007, Technometrics.

[22] Rong Jin,et al. Excess Risk Bounds for Exponentially Concave Losses , 2014, ArXiv.

[23] Jason Weston,et al. A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[24] Hao Yu,et al. Levenberg—Marquardt Training , 2011 .

[25] Stephen P. Boyd,et al. Proximal Algorithms , 2013, Found. Trends Optim..

[26] Simon Haykin,et al. Neural Networks: A Comprehensive Foundation , 1998 .

[27] Michael Möller,et al. Proximal Backpropagation , 2017, ICLR.

[28] Suvrit Sra,et al. Global optimality conditions for deep neural networks , 2017, ICLR.

[29] Olvi L. Mangasarian,et al. Backpropagation Convergence via Deterministic Nonmonotone Perturbed Minimization , 1993, NIPS.

[30] Haihao Lu,et al. Depth Creates No Bad Local Minima , 2017, ArXiv.

[31] Yuandong Tian,et al. An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[32] Zhi-Quan Luo,et al. A Unified Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization , 2012, SIAM J. Optim..

[33] Ronald L. Rivest,et al. Training a 3-node neural network is NP-complete , 1988, COLT '88.

[34] Maria Gabriela Eberle,et al. Finding the closest Toeplitz matrix , 2003 .

[35] Kurt Hornik,et al. Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[36] Shai Shalev-Shwartz,et al. Beyond Convexity: Stochastic Quasi-Convex Optimization , 2015, NIPS.

[37] Andrew R. Barron,et al. Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.