On ADMM in Deep Learning: Convergence and Saturation-Avoidance

In this paper, we develop an alternating direction method of multipliers (ADMM) for deep neural networks training with sigmoid-type activation functions (called \textit{sigmoid-ADMM pair}), mainly motivated by the gradient-free nature of ADMM in avoiding the saturation of sigmoid-type activations and the advantages of deep neural networks with sigmoid-type activations (called deep sigmoid nets) over their rectified linear unit (ReLU) counterparts (called deep ReLU nets) in terms of approximation. In particular, we prove that the approximation capability of deep sigmoid nets is not worse than deep ReLU nets by showing that ReLU activation fucntion can be well approximated by deep sigmoid nets with two hidden layers and finitely many free parameters but not vice-verse. We also establish the global convergence of the proposed ADMM for the nonlinearly constrained formulation of the deep sigmoid nets training to a Karush-Kuhn-Tucker (KKT) point at a rate of order ${\cal O}(1/k)$. Compared with the widely used stochastic gradient descent (SGD) algorithm for the deep ReLU nets training (called ReLU-SGD pair), the proposed sigmoid-ADMM pair is practically stable with respect to the algorithmic hyperparameters including the learning rate, initial schemes and the pro-processing of the input data. Moreover, we find that to approximate and learn simple but important functions the proposed sigmoid-ADMM pair numerically outperforms the ReLU-SGD pair.

[1]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[3]  Adrian S. Lewis,et al.  The [barred L]ojasiewicz Inequality for Nonsmooth Subanalytic Functions with Applications to Subgradient Dynamical Systems , 2006, SIAM J. Optim..

[4]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[5]  Yann LeCun,et al.  A theoretical framework for back-propagation , 1988 .

[6]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[7]  Charles K. Chui,et al.  Construction of Neural Networks for Realization of Localized Deep Learning , 2018, Front. Appl. Math. Stat..

[8]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[9]  Ding-Xuan Zhou,et al.  Universality of Deep Convolutional Neural Networks , 2018, Applied and Computational Harmonic Analysis.

[10]  Wenbo Gao,et al.  ADMM for multiaffine constrained optimization , 2018, Optim. Methods Softw..

[11]  David J. Kriegman,et al.  Acquiring linear subspaces for face recognition under variable lighting , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  David Rolnick,et al.  How to Start Training: The Effect of Initialization and Architecture , 2018, NeurIPS.

[13]  Shao-Bo Lin,et al.  Generalization and Expressivity for Deep Nets , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[14]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[15]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[16]  Min Zhang,et al.  Fully-Corrective Gradient Boosting with Squared Hinge: Fast Learning Rates and Early Stopping , 2020, Neural Networks.

[17]  H. N. Mhaskar,et al.  Neural Networks for Optimal Approximation of Smooth and Analytic Functions , 1996, Neural Computation.

[18]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[19]  Xiaoqin Zhang,et al.  Constructive Neural Network Learning , 2016, IEEE Transactions on Cybernetics.

[20]  Laurent El Ghaoui,et al.  Fenchel Lifted Networks: A Lagrange Relaxation of Neural Network Training , 2018, AISTATS.

[21]  Quanquan Gu,et al.  An Improved Analysis of Training Over-parameterized Deep Neural Networks , 2019, NeurIPS.

[22]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[23]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[24]  Richard Socher,et al.  A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation , 2018, ICLR.

[25]  Christoph Schwab,et al.  Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions in UQ , 2018, Analysis and Applications.

[26]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[27]  Charles K. Chui,et al.  Realization of Spatial Sparseness by Deep ReLU Nets With Massive Data , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[28]  Miguel Á. Carreira-Perpiñán,et al.  Distributed optimization of deeply nested systems , 2012, AISTATS.

[29]  Charles K. Chui,et al.  Deep Neural Networks for Rotation-Invariance Approximation and Learning , 2019, Analysis and Applications.

[30]  Richard M. Schwartz,et al.  Fast and Robust Neural Network Joint Models for Statistical Machine Translation , 2014, ACL.

[31]  Wai Keung Wong,et al.  Low-rank discriminative regression learning for image classification , 2020, Neural Networks.

[32]  K. Kurdyka On gradients of functions definable in o-minimal structures , 1998 .

[33]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[34]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[35]  Lei Shi,et al.  Realizing Data Features by Deep Nets , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[36]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[37]  Yuan Yao,et al.  A Proximal Block Coordinate Descent Algorithm for Deep Neural Network Training , 2018, ICLR.

[38]  Ding-Xuan Zhou Deep distributed convolutional neural networks: Universality , 2018, Analysis and Applications.

[39]  Zhi-Quan Luo,et al.  Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems , 2014, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[41]  Ohad Shamir,et al.  Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks , 2016, ICML.

[42]  Ziming Zhang,et al.  Convergent Block Coordinate Descent for Training Tikhonov Regularized Deep Neural Networks , 2017, NIPS.

[43]  Benar Fux Svaiter,et al.  Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods , 2013, Math. Program..

[44]  Graham J. Williams,et al.  Big Data Opportunities and Challenges: Discussions from Data Analytics Perspectives [Discussion Forum] , 2014, IEEE Computational Intelligence Magazine.

[45]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[46]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[47]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[48]  B. Mordukhovich Variational analysis and generalized differentiation , 2006 .

[49]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[50]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[51]  Georg Heigold,et al.  An empirical study of learning rates in deep neural networks for speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[52]  Philipp Petersen,et al.  Optimal approximation of piecewise smooth functions using deep ReLU neural networks , 2017, Neural Networks.

[53]  Martin Jaggi,et al.  Decoupling Backpropagation using Constrained Optimization Methods , 2018 .

[54]  Christian Gagné,et al.  Alternating Direction Method of Multipliers for Sparse Convolutional Neural Networks , 2016, ArXiv.

[55]  Zhi Han,et al.  Depth Selection for Deep ReLU Nets in Feature Extraction and Generalization , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Simon Lucey,et al.  Deep Component Analysis via Alternating Direction Neural Networks , 2018, ECCV.

[57]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[58]  H. Mhaskar,et al.  Neural networks for localized approximation , 1994 .

[59]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[60]  Sebastian Nowozin,et al.  Learning Step Size Controllers for Robust Neural Network Training , 2016, AAAI.

[61]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[62]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[63]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).