On the Provable Generalization of Recurrent Neural Networks

Recurrent Neural Network (RNN) is a fundamental structure in deep learning. Recently, some works study the training process of over-parameterized neural networks, and show that over-parameterized networks can learn functions in some notable concept classes with a provable generalization error bound. In this paper, we analyze the training and generalization for RNNs with random initialization, and provide the following improvements over recent works: (1) For a RNN with input sequence x = (X1, X2, ..., XL), previous works study to learn functions that are summation of f(β l Xl) and require normalized conditions that ||Xl|| ≤ ǫ with some very small ǫ depending on the complexity of f . In this paper, using detailed analysis about the neural tangent kernel matrix, we prove a generalization error bound to learn such functions without normalized conditions and show that some notable concept classes are learnable with the numbers of iterations and samples scaling almost-polynomially in the input length L. (2) Moreover, we prove a novel result to learn N-variables functions of input sequence with the form f(β [Xl1 , ..., XlN ]), which do not belong to the “additive” concept class, i,e., the summation of function f(Xl). And we show that when either N or l0 = max(l1, .., lN ) − min(l1, .., lN ) is small, f(β [Xl1 , ..., XlN ]) will be learnable with the number iterations and samples scaling almost-polynomially in the input length L.

[1]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[2]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[3]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[4]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[5]  S. Boucheron,et al.  Concentration inequalities : a non asymptotic theory of independence , 2013 .

[6]  Tuo Zhao,et al.  Why Do Deep Residual Networks Generalize Better than Deep Feedforward Networks? - A Neural Tangent Kernel Perspective , 2020, NeurIPS.

[7]  Yuandong Tian,et al.  When is a Convolutional Filter Easy To Learn? , 2017, ICLR.

[8]  Yuandong Tian,et al.  Symmetry-Breaking Convergence Analysis of Certain Two-layered Neural Networks with ReLU nonlinearity , 2017, ICLR.

[9]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[10]  Yuan Cao,et al.  How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks? , 2019, ICLR.

[11]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[12]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[13]  Francis Bach,et al.  Deep Equals Shallow for ReLU Networks in Kernel Regimes , 2020, ICLR.

[14]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[15]  C. Berg,et al.  Harmonic Analysis on Semigroups: Theory of Positive Definite and Related Functions , 1984 .

[16]  Zichao Wang,et al.  The Recurrent Neural Tangent Kernel , 2021, ICLR.

[17]  N. Weber,et al.  A martingale approach to central limit theorems for exchangeable random variables , 1980, Journal of Applied Probability.

[18]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[19]  Matus Telgarsky,et al.  Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks , 2020, ICLR.

[20]  Yuanzhi Li,et al.  On the Convergence Rate of Training Recurrent Neural Networks , 2018, NeurIPS.

[21]  Jason D. Lee,et al.  On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[22]  Yuan Cao,et al.  Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[23]  Yuanzhi Li,et al.  Can SGD Learn Recurrent Neural Networks with Provable Generalization? , 2019, NeurIPS.

[24]  Tuo Zhao,et al.  On Generalization Bounds of a Family of Recurrent Neural Networks , 2018, AISTATS.

[25]  Ning Zhao,et al.  Is the Skip Connection Provable to Reform the Neural Network Loss Landscape? , 2020, ArXiv.

[26]  Yuanzhi Li,et al.  What Can ResNet Learn Efficiently, Going Beyond Kernels? , 2019, NeurIPS.

[27]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[28]  Yu Bai,et al.  Towards Understanding Hierarchical Learning: Benefits of Neural Representations , 2020, NeurIPS.

[29]  Yuanzhi Li,et al.  Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK , 2020, COLT 2020.

[30]  Tengyu Ma,et al.  Gradient Descent Learns Linear Dynamical Systems , 2016, J. Mach. Learn. Res..

[31]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[32]  Peter L. Bartlett,et al.  Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks , 2017, J. Mach. Learn. Res..

[33]  Jason D. Lee,et al.  Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks , 2019, ICLR.

[34]  Joan Bruna,et al.  Topology and Geometry of Half-Rectified Network Optimization , 2016, ICLR.

[35]  Inderjit S. Dhillon,et al.  Stabilizing Gradients for Deep Neural Networks via Efficient SVD Parameterization , 2018, ICML.