FALKON: An Optimal Large Scale Kernel Method

Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited applicability in large scale scenarios because of stringent computational requirements in terms of time and especially memory. In this paper, we take a substantial step in scaling up kernel methods, proposing FALKON, a novel algorithm that allows to efficiently process millions of points. FALKON is derived combining several algorithmic principles, namely stochastic subsampling, iterative solvers and preconditioning. Our theoretical analysis shows that optimal statistical accuracy is achieved requiring essentially $O(n)$ memory and $O(n\sqrt{n})$ time. An extensive experimental analysis on large scale datasets shows that, even with a single machine, FALKON outperforms previous state of the art solutions, which exploit parallel/distributed architectures.

[1]  C J Isham,et al.  Methods of Modern Mathematical Physics, Vol 1: Functional Analysis , 1972 .

[2]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[3]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[4]  Gábor Lugosi,et al.  Concentration Inequalities , 2008, COLT.

[5]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[6]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[7]  Lorenzo Rosasco,et al.  Learning from Examples as an Inverse Problem , 2005, J. Mach. Learn. Res..

[8]  Lorenzo Rosasco,et al.  On regularization algorithms in learning theory , 2007, J. Complex..

[9]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[10]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[11]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[12]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[13]  Lorenzo Rosasco,et al.  Spectral Algorithms for Supervised Learning , 2008, Neural Computation.

[14]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[15]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[16]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[17]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[18]  Michael J. McCourt,et al.  Stable Evaluation of Gaussian Radial Basis Function Interpolants , 2012, SIAM J. Sci. Comput..

[19]  Francis R. Bach,et al.  Sharp analysis of low-rank kernel matrix approximations , 2012, COLT.

[20]  Lorenzo Rosasco,et al.  On the Sample Complexity of Subspace Learning , 2013, NIPS.

[21]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[22]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[23]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[24]  Tara N. Sainath,et al.  Kernel methods match Deep Neural Networks on TIMIT , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  L. Rosasco,et al.  Less is More: Nystr\"om Computational Regularization , 2015 .

[26]  Richard Peng,et al.  Uniform Sampling for Matrix Approximation , 2014, ITCS.

[27]  DuchiJohn,et al.  Divide and conquer kernel ridge regression , 2015 .

[28]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[29]  Martin J. Wainwright,et al.  Randomized sketches for kernels: Fast and optimal non-parametric regression , 2015, ArXiv.

[30]  Lorenzo Rosasco,et al.  Less is More: Nyström Computational Regularization , 2015, NIPS.

[31]  Benjamin Recht,et al.  Large Scale Kernel Learning using Block Coordinate Descent , 2016, ArXiv.

[32]  Michael A. Osborne,et al.  Preconditioning Kernel Matrices , 2016, ICML.

[33]  Lorenzo Rosasco,et al.  NYTRO: When Subsampling Meets Early Stopping , 2015, AISTATS.

[34]  Alexandre Alves,et al.  Stacking machine learning classifiers to identify Higgs bosons at the LHC , 2016, ArXiv.

[35]  Francesco Orabona,et al.  Solving Ridge Regression using Sketched Preconditioned SVRG , 2016, ICML.

[36]  David P. Woodruff,et al.  Faster Kernel Ridge Regression Using Sketching and Preconditioning , 2016, SIAM J. Matrix Anal. Appl..

[37]  Jie Chen,et al.  Hierarchically Compositional Kernels for Scalable Nonparametric Learning , 2016, J. Mach. Learn. Res..

[38]  Hyunjoong Kim,et al.  Functional Analysis I , 2017 .

[39]  Francis R. Bach,et al.  On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions , 2015, J. Mach. Learn. Res..

[40]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[41]  Mikhail Belkin,et al.  Diving into the shallows: a computational perspective on large-scale shallow learning , 2017, NIPS.

[42]  Lorenzo Rosasco,et al.  Generalization Properties of Learning with Random Features , 2016, NIPS.

[43]  Brian Kingsbury,et al.  Kernel Approximation Methods for Speech Recognition , 2017, J. Mach. Learn. Res..