论文信息 - On the Provable Generalization of Recurrent Neural Networks - 字舞流文

On the Provable Generalization of Recurrent Neural Networks

Recurrent Neural Network (RNN) is a fundamental structure in deep learning. Recently, some works study the training process of over-parameterized neural networks, and show that over-parameterized networks can learn functions in some notable concept classes with a provable generalization error bound. In this paper, we analyze the training and generalization for RNNs with random initialization, and provide the following improvements over recent works: (1) For a RNN with input sequence x = (X1, X2, ..., XL), previous works study to learn functions that are summation of f(β l Xl) and require normalized conditions that ||Xl|| ≤ ǫ with some very small ǫ depending on the complexity of f . In this paper, using detailed analysis about the neural tangent kernel matrix, we prove a generalization error bound to learn such functions without normalized conditions and show that some notable concept classes are learnable with the numbers of iterations and samples scaling almost-polynomially in the input length L. (2) Moreover, we prove a novel result to learn N-variables functions of input sequence with the form f(β [Xl1 , ..., XlN ]), which do not belong to the “additive” concept class, i,e., the summation of function f(Xl). And we show that when either N or l0 = max(l1, .., lN ) − min(l1, .., lN ) is small, f(β [Xl1 , ..., XlN ]) will be learnable with the number iterations and samples scaling almost-polynomially in the input length L.

Bo Shen | Lifu Wang | Bo Hu | Xing Cao | Lifu Wang | Bo Shen | Xing Cao | Bo Hu

[1] Barnabás Póczos,et al. Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[2] Yuanzhi Li,et al. Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[3] Adel Javanmard,et al. Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[4] Yuanzhi Li,et al. Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[5] S. Boucheron,et al. Concentration inequalities : a non asymptotic theory of independence , 2013 .

[6] Tuo Zhao,et al. Why Do Deep Residual Networks Generalize Better than Deep Feedforward Networks? - A Neural Tangent Kernel Perspective , 2020, NeurIPS.

[7] Yuandong Tian,et al. When is a Convolutional Filter Easy To Learn? , 2017, ICLR.

[8] Yuandong Tian,et al. Symmetry-Breaking Convergence Analysis of Certain Two-layered Neural Networks with ReLU nonlinearity , 2017, ICLR.

[9] Ohad Shamir,et al. Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[10] Yuan Cao,et al. How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks? , 2019, ICLR.

[11] Yoram Singer,et al. Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[12] Yuanzhi Li,et al. A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[13] Francis Bach,et al. Deep Equals Shallow for ReLU Networks in Kernel Regimes , 2020, ICLR.

[14] 俊一甘利. 5分で分かる!? 有名論文ナナメ読み：Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[15] C. Berg,et al. Harmonic Analysis on Semigroups: Theory of Positive Definite and Related Functions , 1984 .

[16] Zichao Wang,et al. The Recurrent Neural Tangent Kernel , 2021, ICLR.

[17] N. Weber,et al. A martingale approach to central limit theorems for exchangeable random variables , 1980, Journal of Applied Probability.

[18] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[19] Matus Telgarsky,et al. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks , 2020, ICLR.

[20] Yuanzhi Li,et al. On the Convergence Rate of Training Recurrent Neural Networks , 2018, NeurIPS.

[21] Jason D. Lee,et al. On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[22] Yuan Cao,et al. Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[23] Yuanzhi Li,et al. Can SGD Learn Recurrent Neural Networks with Provable Generalization? , 2019, NeurIPS.

[24] Tuo Zhao,et al. On Generalization Bounds of a Family of Recurrent Neural Networks , 2018, AISTATS.

[25] Ning Zhao,et al. Is the Skip Connection Provable to Reform the Neural Network Loss Landscape? , 2020, ArXiv.

[26] Yuanzhi Li,et al. What Can ResNet Learn Efficiently, Going Beyond Kernels? , 2019, NeurIPS.

[27] Tengyu Ma,et al. Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[28] Yu Bai,et al. Towards Understanding Hierarchical Learning: Benefits of Neural Representations , 2020, NeurIPS.

[29] Yuanzhi Li,et al. Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK , 2020, COLT 2020.

[30] Tengyu Ma,et al. Gradient Descent Learns Linear Dynamical Systems , 2016, J. Mach. Learn. Res..

[31] Ruosong Wang,et al. Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[32] Peter L. Bartlett,et al. Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks , 2017, J. Mach. Learn. Res..

[33] Jason D. Lee,et al. Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks , 2019, ICLR.

[34] Joan Bruna,et al. Topology and Geometry of Half-Rectified Network Optimization , 2016, ICLR.

[35] Inderjit S. Dhillon,et al. Stabilizing Gradients for Deep Neural Networks via Efficient SVD Parameterization , 2018, ICML.