Improved Linear Convergence of Training CNNs With Generalizability Guarantees: A One-Hidden-Layer Case

We analyze the learning problem of one-hidden-layer nonoverlapping convolutional neural networks with the rectified linear unit (ReLU) activation function from the perspective of model estimation. The training outputs are assumed to be generated by the neural network with the unknown ground-truth parameters plus some additive noise, and the objective is to estimate the model parameters by minimizing a nonconvex squared loss function of the training data. Assuming that the training set contains a finite number of samples generated from the Gaussian distribution, we prove that the accelerated gradient descent (GD) algorithm with a proper initialization converges to the ground-truth parameters (up to the noise level) with a linear rate even though the learning problem is nonconvex. Moreover, the convergence rate is proved to be faster than the vanilla GD. The initialization can be achieved by the existing tensor initialization method. In contrast to the existing works that assume an infinite number of samples, we theoretically establish the sample complexity of the required number of training samples. Although the neural network considered here is not deep, this is the first work to show that accelerated GD algorithms can find the global optimizer of the nonconvex learning problem of neural networks. This is also the first work that characterizes the sample complexity of gradient-based methods in learning convolutional neural networks with the nonsmooth ReLU activation function. This work also provides the tightest bound so far of the estimation error with respect to the output noise.

[1]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[2]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[3]  R. Srikant,et al.  Adding One Neuron Can Eliminate All Bad Local Minima , 2018, NeurIPS.

[4]  Xiuping Jia,et al.  Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks , 2016, IEEE Transactions on Geoscience and Remote Sensing.

[5]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[6]  Xiao Zhang,et al.  Learning One-hidden-layer ReLU Networks via Gradient Descent , 2018, AISTATS.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Gang Wang,et al.  Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization , 2018, IEEE Transactions on Signal Processing.

[9]  Inderjit S. Dhillon,et al.  Learning Non-overlapping Convolutional Neural Networks with Multiple Kernels , 2017, ArXiv.

[10]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[11]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[12]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[13]  Yingbin Liang,et al.  Guaranteed Recovery of One-Hidden-Layer Neural Networks via Cross Entropy , 2018, IEEE Transactions on Signal Processing.

[14]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[15]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[16]  Alexandros G. Dimakis,et al.  Learning Distributions Generated by One-Layer ReLU Networks , 2019, NeurIPS.

[17]  Ah Chung Tsoi,et al.  Face recognition: a convolutional neural-network approach , 1997, IEEE Trans. Neural Networks.

[18]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[19]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[20]  Yuandong Tian,et al.  When is a Convolutional Filter Easy To Learn? , 2017, ICLR.

[21]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[22]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[23]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[24]  Yuandong Tian,et al.  Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.

[25]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[26]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[27]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[28]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[29]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[30]  Yuandong Tian,et al.  An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[31]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[32]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[33]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[34]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[35]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[36]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[37]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[38]  Tianbao Yang,et al.  Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization , 2016, 1604.03257.

[39]  Thomas Laurent,et al.  The Multilinear Structure of ReLU Networks , 2017, ICML.

[40]  H. Sebastian Seung,et al.  Permitted and Forbidden Sets in Symmetric Threshold-Linear Networks , 2003, Neural Computation.

[41]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[42]  Raghu Meka,et al.  Learning One Convolutional Layer with Overlapping Patches , 2018, ICML.

[43]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[44]  Gilad Yehudai,et al.  On the Power and Limitations of Random Features for Understanding Neural Networks , 2019, NeurIPS.

[45]  Percy Liang,et al.  Tensor Factorization via Matrix Factorization , 2015, AISTATS.

[46]  Ohad Shamir,et al.  Distribution-Specific Hardness of Learning Neural Networks , 2016, J. Mach. Learn. Res..