Generalized Leverage Score Sampling for Neural Networks

Leverage score sampling is a powerful technique that originates from theoretical computer science, which can be used to speed up a large number of fundamental questions, e.g. linear regression, linear programming, semi-definite programming, cutting plane method, graph sparsification, maximum matching and max-flow. Recently, it has been shown that leverage score sampling helps to accelerate kernel methods [Avron, Kapralov, Musco, Musco, Velingker and Zandieh 17]. In this work, we generalize the results in [Avron, Kapralov, Musco, Musco, Velingker and Zandieh 17] to a broader class of kernels. We further bring the leverage score sampling into the field of deep learning theory. $\bullet$ We show the connection between the initialization for neural network training and approximating the neural tangent kernel with random features. $\bullet$ We prove the equivalence between regularized neural network and neural tangent kernel ridge regression under the initialization of both classical random Gaussian and leverage score sampling.

[1]  Martin J. Wainwright,et al.  Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates , 2013, J. Mach. Learn. Res..

[2]  Amit Daniely,et al.  SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[3]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[4]  Xin Yang,et al.  Quadratic Suffices for Over-parametrization via Matrix Chernoff Bound , 2019, ArXiv.

[5]  Christos Boutsidis,et al.  Optimal CUR matrix decompositions , 2014, STOC.

[6]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[7]  Omri Weinstein,et al.  Training (Overparametrized) Neural Networks in Near-Linear Time , 2020, ITCS.

[8]  David P. Woodruff,et al.  Learning Two Layer Rectified Neural Networks in Polynomial Time , 2018, COLT.

[9]  Francis R. Bach,et al.  Sharp analysis of low-rank kernel matrix approximations , 2012, COLT.

[10]  Pravin M. Vaidya,et al.  A new algorithm for minimizing convex functions over convex sets , 1996, Math. Program..

[11]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[12]  Yin Tat Lee,et al.  A Faster Cutting Plane Method and its Implications for Combinatorial and Convex Optimization , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[13]  David P. Woodruff,et al.  Low Rank Approximation with Entrywise ℓ1-Norm Error , 2016, ArXiv.

[14]  Michael B. Cohen,et al.  Input Sparsity Time Low-rank Approximation via Ridge Leverage Score Sampling , 2015, SODA.

[15]  David P. Woodruff,et al.  Is Input Sparsity Time Possible for Kernel Low-Rank Approximation? , 2017, NIPS.

[16]  Aaron Sidford,et al.  Faster energy maximization for faster maximum flow , 2019, STOC.

[17]  Yuanzhi Li,et al.  On the Convergence Rate of Training Recurrent Neural Networks , 2018, NeurIPS.

[18]  Cameron Musco,et al.  Recursive Sampling for the Nystrom Method , 2016, NIPS.

[19]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[20]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[21]  Richard Peng,et al.  $\ell_p$ Row Sampling by Lewis Weights , 2014, 1412.0588.

[22]  Michael B. Cohen,et al.  Ridge Leverage Scores for Low-Rank Approximation , 2015, ArXiv.

[23]  Xue Chen,et al.  Fourier-Sparse Interpolation without a Frequency Gap , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[24]  Zhao Song,et al.  A robust multi-dimensional sparse Fourier transform in the continuous setting , 2020, ArXiv.

[25]  David P. Woodruff,et al.  Sublinear Time Low-Rank Approximation of Positive Semidefinite Matrices , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[26]  Ameya Velingker,et al.  Scaling up Kernel Ridge Regression via Locality Sensitive Hashing , 2020, AISTATS.

[27]  Ankur Moitra,et al.  Algorithmic foundations for the diffraction limit , 2020, STOC.

[28]  Yuandong Tian,et al.  Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.

[29]  Huy L. Nguyen,et al.  OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[30]  J. Lindenstrauss,et al.  Approximation of zonoids by zonotopes , 1989 .

[31]  Yin Tat Lee,et al.  Path Finding Methods for Linear Programming: Solving Linear Programs in Õ(vrank) Iterations and Faster Algorithms for Maximum Flow , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[32]  Yuandong Tian,et al.  An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[33]  Yin Tat Lee,et al.  A Faster Interior Point Method for Semidefinite Programming , 2020, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS).

[34]  David P. Woodruff,et al.  Relative Error Tensor Low Rank Approximation , 2017, Electron. Colloquium Comput. Complex..

[35]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[36]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[37]  Richard Peng,et al.  Lp Row Sampling by Lewis Weights , 2015, STOC.

[38]  Zhao Song,et al.  A Robust Sparse Fourier Transform in the Continuous Setting , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[39]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[40]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[41]  Francis Bach,et al.  A Note on Lazy Training in Supervised Differentiable Programming , 2018, ArXiv.

[42]  Zhao Song,et al.  Breaking the n-Pass Barrier: A Streaming Algorithm for Maximum Weight Bipartite Matching. , 2020 .

[43]  David P. Woodruff,et al.  Faster Kernel Ridge Regression Using Sketching and Preconditioning , 2016, SIAM J. Matrix Anal. Appl..

[44]  Ameya Velingker,et al.  A universal sampling method for reconstructing signals with simple Fourier transforms , 2018, STOC.

[45]  Nikhil Srivastava,et al.  Graph sparsification by effective resistances , 2008, SIAM J. Comput..

[46]  Yin Tat Lee,et al.  An improved cutting plane method for convex optimization, convex-concave games, and its applications , 2020, STOC.

[47]  Aleksander Madry,et al.  Navigating Central Path with Electrical Flows: From Flows to Matchings, and Back , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[48]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[49]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[50]  Aaron Schild,et al.  An almost-linear time algorithm for uniform random spanning tree generation , 2017, STOC.

[51]  Yin Tat Lee,et al.  A near-optimal algorithm for approximating the John Ellipsoid , 2019, COLT.

[52]  David P. Woodruff,et al.  Low rank approximation with entrywise l1-norm error , 2017, STOC.

[53]  Richard Peng,et al.  Bipartite Matching in Nearly-linear Time on Moderately Dense Graphs , 2020, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS).

[54]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[55]  Yin Tat Lee,et al.  Solving tall dense linear programs in nearly linear time , 2020, STOC.

[56]  Richard Peng,et al.  ℓp Row Sampling by Lewis Weights , 2014, ArXiv.

[57]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[58]  Daniel A. Spielman,et al.  Faster approximate lossy generalized flow via interior point algorithms , 2008, STOC.

[59]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[60]  Mahdi Soltanolkotabi,et al.  Learning ReLUs via Gradient Descent , 2017, NIPS.

[61]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[62]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[63]  Aaron Sidford,et al.  Faster Divergence Maximization for Faster Maximum Flow , 2020, ArXiv.

[64]  D. R. Lewis Finite dimensional subspaces of $L_{p}$ , 1978 .

[65]  Zhao Song,et al.  Algorithms and Hardness for Linear Algebra on Geometric Graphs , 2020, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS).

[66]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[67]  David P. Woodruff,et al.  Low-Rank PSD Approximation in Input-Sparsity Time , 2017, SODA.

[68]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[69]  Zhu Li,et al.  Towards a Unified Analysis of Random Fourier Features , 2018, ICML.

[70]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[71]  Aleksander Madry,et al.  Computing Maximum Flow with Augmenting Electrical Flows , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[72]  Inderjit S. Dhillon,et al.  Learning Non-overlapping Convolutional Neural Networks with Multiple Kernels , 2017, ArXiv.

[73]  Ameya Velingker,et al.  Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees , 2018, ICML.

[74]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[75]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[76]  Omri Weinstein,et al.  Faster Dynamic Matrix Inverse for Faster LPs , 2020, ArXiv.