Improved Fixed-Rank Nyström Approximation via QR Decomposition: Practical and Theoretical Aspects

Abstract The Nystrom method is a popular technique that uses a small number of landmark points to compute a fixed-rank approximation of large kernel matrices that arise in machine learning problems. In practice, to ensure high quality approximations, the number of landmark points is chosen to be greater than the target rank. However, for simplicity the standard Nystrom method uses a sub-optimal procedure for rank reduction. In this paper, we examine the drawbacks of the standard Nystrom method in terms of poor performance and lack of theoretical guarantees. To address these issues, we present an efficient modification for generating improved fixed-rank Nystrom approximations. Theoretical analysis and numerical experiments are provided to demonstrate the advantages of the modified method over the standard Nystrom method. Overall, the aim of this paper is to convince researchers to use the modified method, as it has nearly identical computational complexity, is easy to code, has greatly improved accuracy in many cases, and is optimal in a sense that we make precise.

[1]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[2]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[3]  Cameron Musco,et al.  Recursive Sampling for the Nystrom Method , 2016, NIPS.

[4]  Froilán M. Dopico,et al.  Weyl-type relative perturbation bounds for eigensystems of Hermitian matrices , 2000 .

[5]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[6]  Stephen Becker,et al.  Randomized Clustered Nystrom for Large-Scale Kernel Machines , 2016, AAAI.

[7]  Stefanie Jegelka,et al.  Deep Metric Learning via Facility Location , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Zhihua Zhang,et al.  Towards More Efficient SPSD Matrix Approximation and CUR Matrix Decomposition , 2015, J. Mach. Learn. Res..

[9]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[10]  Roy Mathias,et al.  A relative perturbation bound for positive definite matrices , 1998 .

[11]  Inderjit S. Dhillon,et al.  Fast Prediction for Large-Scale Kernel Machines , 2014, NIPS.

[12]  Zhuang Wang,et al.  Scaling Up Kernel SVM on Limited Resources: A Low-Rank Linearization Approach , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[13]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[14]  Dennis DeCoste,et al.  Compact Random Feature Maps , 2013, ICML.

[15]  Johan A. K. Suykens,et al.  Convex Formulation for Kernel PCA and Its Use in Semisupervised Learning , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Volkan Cevher,et al.  Fixed-Rank Approximation of a Positive-Semidefinite Matrix from Streaming Data , 2017, NIPS.

[17]  Michael Elad,et al.  Linearized Kernel Dictionary Learning , 2015, IEEE Journal of Selected Topics in Signal Processing.

[18]  James T. Kwok,et al.  Large-Scale Nyström Kernel Matrix Approximation Using Randomized SVD , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[19]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[20]  James T. Kwok,et al.  Clustered Nyström Method for Large Scale Manifold Learning and Dimension Reduction , 2010, IEEE Transactions on Neural Networks.

[21]  Christos Boutsidis,et al.  Near-Optimal Column-Based Matrix Reconstruction , 2014, SIAM J. Comput..

[22]  Stephen Becker,et al.  A randomized approach to efficient kernel clustering , 2016, 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[23]  Francis R. Bach,et al.  Sharp analysis of low-rank kernel matrix approximations , 2012, COLT.

[24]  Martin J. Wainwright,et al.  Randomized sketches for kernels: Fast and optimal non-parametric regression , 2015, ArXiv.

[25]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[26]  Johan A. K. Suykens,et al.  Large-Scale Clustering Algorithms , 2017 .

[27]  Suvrit Sra,et al.  Fast DPP Sampling for Nystrom with Application to Kernel Methods , 2016, ICML.

[28]  Petros Drineas,et al.  CUR matrix decompositions for improved data analysis , 2009, Proceedings of the National Academy of Sciences.

[29]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[30]  Rong Jin,et al.  Approximate kernel k-means: solution to large scale kernel clustering , 2011, KDD.

[31]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[32]  Volkan Cevher,et al.  Practical Sketching Algorithms for Low-Rank Matrix Approximation , 2016, SIAM J. Matrix Anal. Appl..

[33]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[34]  Johan A. K. Suykens,et al.  Fast kernel spectral clustering , 2017, Neurocomputing.

[35]  Stephen Becker,et al.  Preconditioned Data Sparsification for Big Data With Applications to PCA and K-Means , 2015, IEEE Transactions on Information Theory.

[36]  Steven C. H. Hoi,et al.  Large Scale Online Kernel Learning , 2016, J. Mach. Learn. Res..

[37]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[38]  Ameet Talwalkar,et al.  Sampling Methods for the Nyström Method , 2012, J. Mach. Learn. Res..

[39]  Ivor W. Tsang,et al.  Improved Nyström low-rank approximation and error analysis , 2008, ICML '08.

[40]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[41]  Ameet Talwalkar,et al.  On the Impact of Kernel Approximation on Learning Accuracy , 2010, AISTATS.

[42]  Mark A. Girolami,et al.  Mercer kernel-based clustering in feature space , 2002, IEEE Trans. Neural Networks.

[43]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[44]  Michael W. Mahoney,et al.  Revisiting the Nystrom Method for Improved Large-scale Machine Learning , 2013, J. Mach. Learn. Res..

[45]  Shusen Wang,et al.  Scalable Kernel K-Means Clustering with Nystrom Approximation: Relative-Error Bounds , 2017, J. Mach. Learn. Res..

[46]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[47]  Shiliang Sun,et al.  A review of Nyström methods for large-scale machine learning , 2015, Inf. Fusion.