论文信息 - Generalized Leverage Score Sampling for Neural Networks

Generalized Leverage Score Sampling for Neural Networks

Leverage score sampling is a powerful technique that originates from theoretical computer science, which can be used to speed up a large number of fundamental questions, e.g. linear regression, linear programming, semi-definite programming, cutting plane method, graph sparsification, maximum matching and max-flow. Recently, it has been shown that leverage score sampling helps to accelerate kernel methods [Avron, Kapralov, Musco, Musco, Velingker and Zandieh 17]. In this work, we generalize the results in [Avron, Kapralov, Musco, Musco, Velingker and Zandieh 17] to a broader class of kernels. We further bring the leverage score sampling into the field of deep learning theory. $\bullet$ We show the connection between the initialization for neural network training and approximating the neural tangent kernel with random features. $\bullet$ We prove the equivalence between regularized neural network and neural tangent kernel ridge regression under the initialization of both classical random Gaussian and leverage score sampling.

[1] Martin J. Wainwright,et al. Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates , 2013, J. Mach. Learn. Res..

[2] Amit Daniely,et al. SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[3] David P. Woodruff,et al. Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[4] Xin Yang,et al. Quadratic Suffices for Over-parametrization via Matrix Chernoff Bound , 2019, ArXiv.

[5] Christos Boutsidis,et al. Optimal CUR matrix decompositions , 2014, STOC.

[6] Ruosong Wang,et al. On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[7] Omri Weinstein,et al. Training (Overparametrized) Neural Networks in Near-Linear Time , 2020, ITCS.

[8] David P. Woodruff,et al. Learning Two Layer Rectified Neural Networks in Polynomial Time , 2018, COLT.

[9] Francis R. Bach,et al. Sharp analysis of low-rank kernel matrix approximations , 2012, COLT.

[10] Pravin M. Vaidya,et al. A new algorithm for minimizing convex functions over convex sets , 1996, Math. Program..

[11] Barnabás Póczos,et al. Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[12] Yin Tat Lee,et al. A Faster Cutting Plane Method and its Implications for Combinatorial and Convex Optimization , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[13] David P. Woodruff,et al. Low Rank Approximation with Entrywise ℓ1-Norm Error , 2016, ArXiv.

[14] Michael B. Cohen,et al. Input Sparsity Time Low-rank Approximation via Ridge Leverage Score Sampling , 2015, SODA.

[15] David P. Woodruff,et al. Is Input Sparsity Time Possible for Kernel Low-Rank Approximation? , 2017, NIPS.

[16] Aaron Sidford,et al. Faster energy maximization for faster maximum flow , 2019, STOC.

[17] Yuanzhi Li,et al. On the Convergence Rate of Training Recurrent Neural Networks , 2018, NeurIPS.

[18] Cameron Musco,et al. Recursive Sampling for the Nystrom Method , 2016, NIPS.

[19] Inderjit S. Dhillon,et al. Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[20] Yuanzhi Li,et al. A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[21] Richard Peng,et al. $\ell_p$ Row Sampling by Lewis Weights , 2014, 1412.0588.

[22] Michael B. Cohen,et al. Ridge Leverage Scores for Low-Rank Approximation , 2015, ArXiv.

[23] Xue Chen,et al. Fourier-Sparse Interpolation without a Frequency Gap , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[24] Zhao Song,et al. A robust multi-dimensional sparse Fourier transform in the continuous setting , 2020, ArXiv.

[25] David P. Woodruff,et al. Sublinear Time Low-Rank Approximation of Positive Semidefinite Matrices , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[26] Ameya Velingker,et al. Scaling up Kernel Ridge Regression via Locality Sensitive Hashing , 2020, AISTATS.

[27] Ankur Moitra,et al. Algorithmic foundations for the diffraction limit , 2020, STOC.

[28] Yuandong Tian,et al. Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.

[29] Huy L. Nguyen,et al. OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[30] J. Lindenstrauss,et al. Approximation of zonoids by zonotopes , 1989 .

[31] Yin Tat Lee,et al. Path Finding Methods for Linear Programming: Solving Linear Programs in Õ(vrank) Iterations and Faster Algorithms for Maximum Flow , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[32] Yuandong Tian,et al. An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[33] Yin Tat Lee,et al. A Faster Interior Point Method for Semidefinite Programming , 2020, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS).

[34] David P. Woodruff,et al. Relative Error Tensor Low Rank Approximation , 2017, Electron. Colloquium Comput. Complex..

[35] Arthur Jacot,et al. Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[36] Liwei Wang,et al. Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[37] Richard Peng,et al. Lp Row Sampling by Lewis Weights , 2015, STOC.

[38] Zhao Song,et al. A Robust Sparse Fourier Transform in the Continuous Setting , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[39] Michael W. Mahoney,et al. Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[40] David P. Woodruff,et al. Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[41] Francis Bach,et al. A Note on Lazy Training in Supervised Differentiable Programming , 2018, ArXiv.

[42] Zhao Song,et al. Breaking the n-Pass Barrier: A Streaming Algorithm for Maximum Weight Bipartite Matching. , 2020 .

[43] David P. Woodruff,et al. Faster Kernel Ridge Regression Using Sketching and Preconditioning , 2016, SIAM J. Matrix Anal. Appl..

[44] Ameya Velingker,et al. A universal sampling method for reconstructing signals with simple Fourier transforms , 2018, STOC.

[45] Nikhil Srivastava,et al. Graph sparsification by effective resistances , 2008, SIAM J. Comput..

[46] Yin Tat Lee,et al. An improved cutting plane method for convex optimization, convex-concave games, and its applications , 2020, STOC.

[47] Aleksander Madry,et al. Navigating Central Path with Electrical Flows: From Flows to Matchings, and Back , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[48] Joel A. Tropp,et al. An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[49] Ruosong Wang,et al. Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[50] Aaron Schild,et al. An almost-linear time algorithm for uniform random spanning tree generation , 2017, STOC.

[51] Yin Tat Lee,et al. A near-optimal algorithm for approximating the John Ellipsoid , 2019, COLT.

[52] David P. Woodruff,et al. Low rank approximation with entrywise l1-norm error , 2017, STOC.

[53] Richard Peng,et al. Bipartite Matching in Nearly-linear Time on Moderately Dense Graphs , 2020, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS).

[54] P. Massart,et al. Adaptive estimation of a quadratic functional by model selection , 2000 .

[55] Yin Tat Lee,et al. Solving tall dense linear programs in nearly linear time , 2020, STOC.

[56] Richard Peng,et al. ℓp Row Sampling by Lewis Weights , 2014, ArXiv.

[57] H. Chernoff. A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[58] Daniel A. Spielman,et al. Faster approximate lossy generalized flow via interior point algorithms , 2008, STOC.

[59] W. Hoeffding. Probability Inequalities for sums of Bounded Random Variables , 1963 .

[60] Mahdi Soltanolkotabi,et al. Learning ReLUs via Gradient Descent , 2017, NIPS.

[61] Tengyu Ma,et al. Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[62] Francis Bach,et al. On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[63] Aaron Sidford,et al. Faster Divergence Maximization for Faster Maximum Flow , 2020, ArXiv.

[64] D. R. Lewis. Finite dimensional subspaces of $L_{p}$ , 1978 .

[65] Zhao Song,et al. Algorithms and Hardness for Linear Algebra on Geometric Graphs , 2020, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS).

[66] Yuanzhi Li,et al. Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[67] David P. Woodruff,et al. Low-Rank PSD Approximation in Input-Sparsity Time , 2017, SODA.

[68] Yuanzhi Li,et al. Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[69] Zhu Li,et al. Towards a Unified Analysis of Random Fourier Features , 2018, ICML.

[70] Amir Globerson,et al. Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[71] Aleksander Madry,et al. Computing Maximum Flow with Augmenting Electrical Flows , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[72] Inderjit S. Dhillon,et al. Learning Non-overlapping Convolutional Neural Networks with Multiple Kernels , 2017, ArXiv.

[73] Ameya Velingker,et al. Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees , 2018, ICML.

[74] Yoram Singer,et al. Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[75] Benjamin Recht,et al. Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[76] Omri Weinstein,et al. Faster Dynamic Matrix Inverse for Faster LPs , 2020, ArXiv.