Improved guarantees and a multiple-descent curve for Column Subset Selection and the Nystrom method

The Column Subset Selection Problem (CSSP) and the Nystr\"om method are among the leading tools for constructing small low-rank approximations of large datasets in machine learning and scientific computing. A fundamental question in this area is: how well can a data subset of size k compete with the best rank k approximation? We develop techniques which exploit spectral properties of the data matrix to obtain improved approximation guarantees which go beyond the standard worst-case analysis. Our approach leads to significantly better bounds for datasets with known rates of singular value decay, e.g., polynomial or exponential decay. Our analysis also reveals an intriguing phenomenon: the approximation factor as a function of k may exhibit multiple peaks and valleys, which we call a multiple-descent curve. A lower bound we establish shows that this behavior is not an artifact of our analysis, but rather it is an inherent property of the CSSP and Nystr\"om tasks. Finally, using the example of a radial basis function (RBF) kernel, we show that both our improved bounds and the multiple-descent curve can be observed on real datasets simply by varying the RBF parameter.

[1]  Michael W. Mahoney,et al.  Revisiting the Nystrom Method for Improved Large-scale Machine Learning , 2013, J. Mach. Learn. Res..

[2]  Kenneth L. Clarkson,et al.  Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression , 2019, COLT.

[3]  Linyuan Lu,et al.  Complex Graphs and Networks (CBMS Regional Conference Series in Mathematics) , 2006 .

[4]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[5]  Tim Roughgarden,et al.  Beyond worst-case analysis , 2018, Commun. ACM.

[6]  Carl E. Rasmussen,et al.  Rates of Convergence for Sparse Variational Gaussian Process Regression , 2019, ICML.

[7]  Pierre Chainais,et al.  A determinantal point process for column subset selection , 2018, J. Mach. Learn. Res..

[8]  Oluwasanmi Koyejo,et al.  Examples are not enough, learn to criticize! Criticism for Interpretability , 2016, NIPS.

[9]  Michael W. Mahoney Approximate computation and implicit regularization for very large-scale data analysis , 2012, PODS.

[10]  Mike Gartrell,et al.  Tensorized Determinantal Point Processes for Recommendation , 2019, KDD.

[11]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[12]  Robert D. Nowak,et al.  High-dimensional Matched Subspace Detection when data are missing , 2010, 2010 IEEE International Symposium on Information Theory.

[13]  Nima Anari,et al.  Monte Carlo Markov Chain Algorithms for Sampling Strongly Rayleigh Distributions and Determinantal Point Processes , 2016, COLT.

[14]  Michal Derezinski,et al.  Fast determinantal point processes via distortion-free intermediate sampling , 2018, COLT.

[15]  Christos Boutsidis,et al.  Optimal CUR matrix decompositions , 2014, STOC.

[16]  Christos Boutsidis,et al.  An improved approximation algorithm for the column subset selection problem , 2008, SODA.

[17]  Daniele Calandriello,et al.  Exact sampling of determinantal point processes with sublinear time preprocessing , 2019, NeurIPS.

[18]  Y. Peres,et al.  Determinantal Processes and Independence , 2005, math/0503110.

[19]  S. Muthukrishnan,et al.  Relative-Error CUR Matrix Decompositions , 2007, SIAM J. Matrix Anal. Appl..

[20]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[21]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[22]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[23]  Zhenyu Liao,et al.  A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent , 2020, NeurIPS.

[24]  Petros Drineas,et al.  Column Selection via Adaptive Sampling , 2015, NIPS.

[25]  Christopher K. I. Williams,et al.  Gaussian regression and optimal finite dimensional linear models , 1997 .

[26]  Manfred K. Warmuth,et al.  Reverse iterative volume sampling for linear regression , 2018, J. Mach. Learn. Res..

[27]  Ben Taskar,et al.  k-DPPs: Fixed-Size Determinantal Point Processes , 2011, ICML.

[28]  Ming Gu,et al.  Efficient Algorithms for Computing a Strong Rank-Revealing QR Factorization , 1996, SIAM J. Sci. Comput..

[29]  Daniele Calandriello,et al.  Sampling from a k-DPP without looking at all items , 2020, NeurIPS.

[30]  Michael W. Mahoney,et al.  Bayesian experimental design using regularized determinantal point processes , 2019, AISTATS.

[31]  Michael W. Mahoney,et al.  Determinantal Point Processes in Randomized Numerical Linear Algebra , 2020, Notices of the American Mathematical Society.

[32]  Tengyuan Liang,et al.  On the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels , 2019, COLT.

[33]  Luis Rademacher,et al.  Efficient Volume Sampling for Row/Column Subset Selection , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[34]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[35]  Christos Boutsidis,et al.  Faster Subset Selection for Matrices and Applications , 2011, SIAM J. Matrix Anal. Appl..

[36]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[37]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[38]  O. Macchi The coincidence approach to stochastic point processes , 1975, Advances in Applied Probability.

[39]  Tomaso Poggio,et al.  Double descent in the condition number , 2019, ArXiv.

[40]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[41]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[42]  Per Christian Hansen,et al.  Some Applications of the Rank Revealing QR Factorization , 1992, SIAM J. Sci. Comput..

[43]  Michael W. Mahoney,et al.  BLOCK BASIS FACTORIZATION FOR SCALABLE KERNEL EVALUATION∗ , 2019 .

[44]  Christos Boutsidis,et al.  Near Optimal Column-Based Matrix Reconstruction , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[45]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[46]  Andreas Krause,et al.  Convergence Analysis of Block Coordinate Algorithms with Determinantal Sampling , 2020, AISTATS.

[47]  Michael W. Mahoney,et al.  Exact expressions for double descent and implicit regularization via surrogate random design , 2019, NeurIPS.

[48]  Cameron Musco,et al.  Recursive Sampling for the Nystrom Method , 2016, NIPS.

[49]  Venkatesan Guruswami,et al.  Optimal column-based low-rank matrix reconstruction , 2011, SODA.

[50]  Mohamed-Ali Belabbas,et al.  Spectral methods in machine learning and new strategies for very large datasets , 2009, Proceedings of the National Academy of Sciences.

[51]  Alexandros G. Dimakis,et al.  Restricted Strong Convexity Implies Weak Submodularity , 2016, The Annals of Statistics.

[53]  Michael W. Mahoney,et al.  MAPPING THE SIMILARITIES OF SPECTRA: GLOBAL AND LOCALLY-BIASED APPROACHES TO SDSS GALAXIES , 2016, ArXiv.

[54]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[55]  Aditya Bhaskara,et al.  Greedy Column Subset Selection: New Bounds and Distributed Algorithms , 2016, ICML.