Depth with Nonlinearity Creates No Bad Local Minima in ResNets

In this paper, we prove that depth with nonlinearity creates no bad local minima in a type of arbitrarily deep ResNets with arbitrary nonlinear activation functions, in the sense that the values of all local minima are no worse than the global minimum value of corresponding classical machine-learning models, and are guaranteed to further improve via residual representations. As a result, this paper provides an affirmative answer to an open question stated in a paper in the conference on Neural Information Processing Systems 2018. This paper advances the optimization theory of deep learning only for ResNets and not for other network architectures.

[1]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[2]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[3]  Jason D. Lee,et al.  On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[4]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[5]  Matthias Hein,et al.  Optimization Landscape and Expressivity of Deep CNNs , 2017, ICML.

[6]  Philip M. Long,et al.  Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks , 2018, Neural Computation.

[7]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[8]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[9]  Tengyu Ma,et al.  On the optimization landscape of tensor decompositions , 2017, Mathematical Programming.

[10]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[11]  Anima Anandkumar,et al.  Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[12]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[13]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[14]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[15]  Adam R. Klivans,et al.  Learning Depth-Three Neural Networks in Polynomial Time , 2017, ArXiv.

[16]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[17]  Elad Hoffer,et al.  Exponentially vanishing sub-optimal local minima in multilayer neural networks , 2017, ICLR.

[18]  Thomas Laurent,et al.  Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global , 2017, ICML.

[19]  Leslie Pack Kaelbling,et al.  Bayesian Optimization with Exponential Convergence , 2015, NIPS.

[20]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[22]  Alexandr Andoni,et al.  Learning Polynomials with Neural Networks , 2014, ICML.

[23]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[24]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[25]  Kyoung Mu Lee,et al.  Accurate Image Super-Resolution Using Very Deep Convolutional Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Katta G. Murty,et al.  Some NP-complete problems in quadratic and nonlinear programming , 1987, Math. Program..

[27]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1995 .

[28]  Razvan Pascanu,et al.  On the number of response regions of deep feed forward networks with piece-wise linear activations , 2013, 1312.6098.

[29]  Ohad Shamir,et al.  Are ResNets Provably Better than Linear Predictors? , 2018, NeurIPS.

[30]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Razvan Pascanu,et al.  On the number of inference regions of deep feed forward networks with piece-wise linear activations , 2013, ICLR.

[32]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Mahdi Soltanolkotabi,et al.  Learning ReLUs via Gradient Descent , 2017, NIPS.

[35]  Yu Maruyama,et al.  Global Continuous Optimization with Error Bound and Fast Convergence , 2016, J. Artif. Intell. Res..

[36]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[37]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[38]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.