Effect of Depth and Width on Local Minima in Deep Learning

In this paper, we analyze the effects of depth and width on the quality of local minima, without strong overparameterization and simplification assumptions in the literature. Without any simplification assumption, for deep nonlinear neural networks with the squared loss, we theoretically show that the quality of local minima tends to improve toward the global minimum value as depth and width increase. Furthermore, with a locally induced structure on deep nonlinear neural networks, the values of local minima of neural networks are theoretically proven to be no worse than the globally optimal values of corresponding classical machine learning models. We empirically support our theoretical observation with a synthetic data set, as well as MNIST, CIFAR-10, and SVHN data sets. When compared to previous studies with strong overparameterization assumptions, the results in this letter do not require overparameterization and instead show the gradual effects of overparameterization as consequences of general results.

[1]  Ohad Shamir,et al.  Are ResNets Provably Better than Linear Predictors? , 2018, NeurIPS.

[2]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[3]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[4]  Leslie Pack Kaelbling,et al.  Bayesian Optimization with Exponential Convergence , 2015, NIPS.

[5]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[6]  Will Tribbey,et al.  Numerical Recipes: The Art of Scientific Computing (3rd Edition) is written by William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery, and published by Cambridge University Press, © 2007, hardback, ISBN 978-0-521-88068-8, 1235 pp. , 1987, SOEN.

[7]  Yoshua Bengio,et al.  Depth with Nonlinearity Creates No Bad Local Minima in ResNets , 2019, Neural Networks.

[8]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[9]  Katta G. Murty,et al.  Some NP-complete problems in quadratic and nonlinear programming , 1987, Math. Program..

[10]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[11]  Matthias Hein,et al.  Optimization Landscape and Expressivity of Deep CNNs , 2017, ICML.

[12]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[13]  Le Song,et al.  Diverse Neural Network Learns True Target Functions , 2016, AISTATS.

[14]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[15]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[16]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[17]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[18]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19]  Shalabh,et al.  Linear Models and Generalizations: Least Squares and Alternatives , 2007 .

[20]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[21]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[22]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[23]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[24]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[25]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[26]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[27]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[28]  William H. Press,et al.  Numerical Recipes 3rd Edition: The Art of Scientific Computing , 2007 .

[29]  Wei Hu,et al.  A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks , 2018, ICLR.