One-pass Stochastic Gradient Descent in overparametrized two-layer neural networks

There has been a recent surge of interest in understanding the convergence of gradient descent (GD) and stochastic gradient descent (SGD) in overparameterized neural networks. Most previous work assumes that the training data is provided a priori in a batch, while less attention has been paid to the important setting where the training data arrives in a stream. In this paper, we study the streaming data setup and show that with overparamterization and random initialization, the prediction error of two-layer neural networks under one-pass SGD converges in expectation. The convergence rate depends on the eigen-decomposition of the integral operator associated with the so-called neural tangent kernel (NTK). A key step of our analysis is to show a random kernel function converges to the NTK with high probability using the VC dimension and McDiarmid inequality.

[1]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[2]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[3]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[4]  Arieh Iserles,et al.  On Rapid Computation of Expansions in Ultraspherical Polynomials , 2012, SIAM J. Numer. Anal..

[5]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[6]  Yuanzhi Li,et al.  Can SGD Learn Recurrent Neural Networks with Provable Generalization? , 2019, NeurIPS.

[7]  Lili Su,et al.  On Learning Over-parameterized Neural Networks: A Functional Approximation Prospective , 2019, NeurIPS.

[8]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[9]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[10]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[11]  Yuan Cao,et al.  Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[12]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[13]  V. Koltchinskii,et al.  High Dimensional Probability , 2006, math/0612726.

[14]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[15]  Yuanzhi Li,et al.  On the Convergence Rate of Training Recurrent Neural Networks , 2018, NeurIPS.

[16]  Ruosong Wang,et al.  Enhanced Convolutional Neural Tangent Kernels , 2019, ArXiv.

[17]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[18]  Joan Feigenbaum,et al.  Secure multiparty computation of approximations , 2001, TALG.

[19]  Yuanzhi Li,et al.  What Can ResNet Learn Efficiently, Going Beyond Kernels? , 2019, NeurIPS.

[20]  Sivaraman Balakrishnan,et al.  How Many Samples are Needed to Estimate a Convolutional Neural Network? , 2018, NeurIPS.

[21]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[22]  Wenqing Hu,et al.  On the diffusion approximation of nonconvex stochastic gradient descent , 2017, Annals of Mathematical Sciences and Applications.

[23]  Suzana Loskovska,et al.  A SURVEY OF STREAM DATA MINING , 2007 .

[24]  E Weinan,et al.  On the Generalization Properties of Minimum-norm Solutions for Over-parameterized Neural Network Models , 2019, ArXiv.

[25]  Yuan Xu,et al.  Approximation Theory and Harmonic Analysis on Spheres and Balls , 2013 .

[26]  Quanquan Gu,et al.  An Improved Analysis of Training Over-parameterized Deep Neural Networks , 2019, NeurIPS.

[27]  Yuan Cao,et al.  Mean-Field Analysis of Two-Layer Neural Networks: Non-Asymptotic Rates and Generalization Bounds , 2020, ArXiv.

[28]  O. Sarbishei,et al.  A Quantitative Comparison of Overlapping and Non-Overlapping Sliding Windows for Human Activity Recognition Using Inertial Sensors , 2019, Sensors.

[29]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[30]  A. W. van der Vaart,et al.  A note on bounds for VC dimensions. , 2009, Institute of Mathematical Statistics collections.

[31]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[32]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[33]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[34]  Leonard J. Tashman,et al.  Out-of-sample tests of forecasting accuracy: an analysis and review , 2000 .

[35]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36]  Yuanzhi Li,et al.  Backward Feature Correction: How Deep Learning Performs Deep Learning , 2020, ArXiv.

[37]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.