Randomized sketches for kernels: Fast and optimal non-parametric regression

Kernel ridge regression (KRR) is a standard method for performing non-parametric regression over reproducing kernel Hilbert spaces. Given $n$ samples, the time and space complexity of computing the KRR estimate scale as $\mathcal{O}(n^3)$ and $\mathcal{O}(n^2)$ respectively, and so is prohibitive in many cases. We propose approximations of KRR based on $m$-dimensional randomized sketches of the kernel matrix, and study how small the projection dimension $m$ can be chosen while still preserving minimax optimality of the approximate KRR estimate. For various classes of randomized sketches, including those based on Gaussian and randomized Hadamard matrices, we prove that it suffices to choose the sketch dimension $m$ proportional to the statistical dimension (modulo logarithmic factors). Thus, we obtain fast and minimax optimal approximations to the KRR estimate for non-parametric regression.

[1]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[2]  Nir Ailon,et al.  Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes , 2008, SODA '08.

[3]  Joel A. Tropp,et al.  Improved Analysis of the subsampled Randomized Hadamard Transform , 2010, Adv. Data Sci. Adapt. Anal..

[4]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[5]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[6]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[7]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[8]  Shahar Mendelson,et al.  Geometric Parameters of Kernel Machines , 2002, COLT.

[9]  Joel A. Tropp,et al.  Living on the edge: phase transitions in convex programs with random data , 2013, 1303.6672.

[10]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[11]  Martin J. Wainwright,et al.  Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares , 2014, J. Mach. Learn. Res..

[12]  Martin J. Wainwright,et al.  Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming , 2010, J. Mach. Learn. Res..

[13]  Martin J. Wainwright,et al.  Randomized sketches of convex programs with sharp guarantees , 2014, 2014 IEEE International Symposium on Information Theory.

[14]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[15]  Jiri Matousek,et al.  Lectures on discrete geometry , 2002, Graduate texts in mathematics.

[16]  C. J. Stone,et al.  Optimal Global Rates of Convergence for Nonparametric Regression , 1982 .

[17]  Martin J. Wainwright,et al.  Divide and Conquer Kernel Ridge Regression , 2013, COLT.

[18]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[19]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[20]  Michael W. Mahoney,et al.  Revisiting the Nystrom Method for Improved Large-scale Machine Learning , 2013, J. Mach. Learn. Res..

[21]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[22]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[23]  Santosh S. Vempala,et al.  The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

[24]  M. Ledoux The concentration of measure phenomenon , 2001 .

[25]  Hans C. van Houwelingen,et al.  The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, New York, 2001. No. of pages: xvi+533. ISBN 0‐387‐95284‐5 , 2004 .

[26]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[27]  G. Wahba Spline models for observational data , 1990 .

[28]  Francis R. Bach,et al.  Sharp analysis of low-rank kernel matrix approximations , 2012, COLT.

[29]  David P. Woodruff,et al.  Optimal Approximate Matrix Product in Terms of Stable Rank , 2015, ICALP.

[30]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[31]  Christos Boutsidis,et al.  Improved Matrix Algorithms via the Subsampled Randomized Hadamard Transform , 2012, SIAM J. Matrix Anal. Appl..

[32]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[33]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[34]  Michael W. Mahoney,et al.  Fast Randomized Kernel Methods With Statistical Guarantees , 2014, ArXiv.

[35]  Chong Gu Smoothing Spline Anova Models , 2002 .

[36]  Rachel Ward,et al.  New and Improved Johnson-Lindenstrauss Embeddings via the Restricted Isometry Property , 2010, SIAM J. Math. Anal..