On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning

A problem for many kernel-based methods is that the amount of computation required to find the solution scales as O(n 3 ), where n is the number of training examples. We develop and analyze an algorithm to compute an easily-interpretable low-rank approximation to an n x n Gram matrix G such that computations of interest may be performed more rapidly. The approximation is of the form G k = CW + k C T , where C is a matrix consisting of a small number c of columns of G and W k is the best rank-k approximation to W, the matrix formed by the intersection between those c columns of G and the corresponding c rows of G. An important aspect of the algorithm is the probability distribution used to randomly sample the columns; we will use a judiciously-chosen and data-dependent nonuniform probability distribution. Let ∥.∥ 2 and ∥.∥ F denote the spectral norm and the Frobenius norm, respectively, of a matrix, and let G k be the best rank-k approximation to G. We prove that by choosing O(k/∈ 4 ) columns ∥G-CW + k C T ∥ ξ ≤ ∥G - G k ∥ ξ + ∈ n Σ i=1 G 2 ii , both in expectation and with high probability, for both ξ = 2, F, and for all k: 0 < k < rank(W). This approximation can be computed using O(n) additional space and time, after making two passes over the data from external storage.

[1]  Adi Ben-Israel,et al.  Generalized inverses: theory and applications , 1974 .

[2]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[3]  L. Delves,et al.  Computational methods for integral equations: Frontmatter , 1985 .

[4]  V. N. Bogaevski,et al.  Matrix Perturbation Theory , 1991 .

[5]  Christopher J. C. Burges,et al.  Simplified Support Vector Decision Rules , 1996, ICML.

[6]  Federico Girosi,et al.  An improved training algorithm for support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[7]  S. Goreinov,et al.  A Theory of Pseudoskeleton Approximations , 1997 .

[8]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[9]  Alan M. Frieze,et al.  Clustering in large graphs and matrices , 1999, SODA '99.

[10]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[11]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[12]  Christopher K. I. Williams,et al.  The Effect of the Input Density Distribution on Kernel-based Classifiers , 2000, ICML.

[13]  B. Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, ICML.

[14]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[15]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[16]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[17]  S. Goreinov,et al.  The maximum-volume concept in approximation by low-rank matrices , 2001 .

[18]  Tong Zhang,et al.  An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods , 2001, AI Mag..

[19]  Anna R. Karlin,et al.  Spectral analysis of data , 2001, STOC '01.

[20]  Petros Drineas,et al.  Fast Monte-Carlo algorithms for approximate matrix multiplication , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[21]  Dimitris Achlioptas,et al.  Fast computation of low rank matrix approximations , 2001, STOC '01.

[22]  Bernhard Schölkopf,et al.  Sampling Techniques for Kernel Methods , 2001, NIPS.

[23]  Christopher K. I. Williams,et al.  Observations on the Nyström Method for Gaussian Processes , 2002 .

[24]  Carl Edward Rasmussen,et al.  Observations on the Nyström Method for Gaussian Process Prediction , 2002 .

[25]  D. Donoho,et al.  Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Nicolas Le Roux,et al.  Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering , 2003, NIPS.

[27]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[28]  Petros Drineas,et al.  Pass efficient algorithms for approximating large matrices , 2003, SODA '03.

[29]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Alan M. Frieze,et al.  Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[31]  Kilian Q. Weinberger,et al.  Learning a kernel matrix for nonlinear dimensionality reduction , 2004, ICML.

[32]  Bernhard Schölkopf,et al.  A kernel view of the dimensionality reduction of manifolds , 2004, ICML.

[33]  Nicolas Le Roux,et al.  Learning Eigenfunctions Links Spectral Embedding and Kernel PCA , 2004, Neural Computation.

[34]  Petros Drineas,et al.  Sampling Sub-problems of Heterogeneous Max-cut Problems and Approximation Algorithms , 2005, STACS.

[35]  Luis Rademacher,et al.  Matrix Approximation and Projective Clustering via Iterative Sampling , 2005 .

[36]  Michael W. Mahoney,et al.  Approximating a Gram Matrix for Improved Kernel-Based Learning (Extended Abstract) , 2005 .

[37]  Petros Drineas,et al.  FAST MONTE CARLO ALGORITHMS FOR MATRICES II: COMPUTING A LOW-RANK APPROXIMATION TO A MATRIX∗ , 2004 .

[38]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition , 2006, SIAM J. Comput..

[39]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication , 2006, SIAM J. Comput..