Fastfood: Approximate Kernel Expansions in Loglinear Time

Despite their successes, what makes kernel methods difficult to use in many large scale problems is the fact that computing the decision function is typically expensive, especially at prediction time. In this paper, we overcome this difficulty by proposing Fastfood, an approximation that accelerates such computation significantly. Key to Fastfood is the observation that Hadamard matrices when combined with diagonal Gaussian matrices exhibit properties similar to dense Gaussian random matrices. Yet unlike the latter, Hadamard and diagonal matrices are inexpensive to multiply and store. These two matrices can be used in lieu of Gaussian matrices in Random Kitchen Sinks (Rahimi & Recht, 2007) and thereby speeding up the computation for a large range of kernel functions. Specifically, Fastfood requires O(n log d) time and O(n) storage to compute n non-linear basis functions in d dimensions, a significant improvement from O(nd) computation and storage, without sacrificing accuracy. We prove that the approximation is unbiased and has low variance. Extensive experiments show that we achieve similar accuracy to full kernel expansions and Random Kitchen Sinks while being 100x faster and using 1000x less memory. These improvements, especially in terms of memory usage, make kernel methods more practical for applications that have large training sets and/or require real-time prediction.

[1]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[2]  J. Mercer Functions of positive and negative type, and their connection with the theory of integral equations , 1909 .

[3]  Par N. Aronszajn La théorie des noyaux reproduisants et ses applications Première Partie , 1943, Mathematical Proceedings of the Cambridge Philosophical Society.

[4]  H. Hochstadt Special functions of mathematical physics , 1961 .

[5]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[6]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[7]  E. Kreyszig Introductory Functional Analysis With Applications , 1978 .

[8]  Steven A. Orszag,et al.  CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS , 1978 .

[9]  C. Berg,et al.  Harmonic Analysis on Semigroups , 1984 .

[10]  C. Micchelli Interpolation of scattered data: Distance matrices and conditionally positive definite functions , 1986 .

[11]  G. Wahba Spline models for observational data , 1990 .

[12]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[13]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[14]  Radford M. Neal Priors for Infinite Networks , 1996 .

[15]  Christopher J. C. Burges,et al.  Simplified Support Vector Decision Rules , 1996, ICML.

[16]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[17]  M. Ledoux,et al.  Isoperimetry and Gaussian analysis , 1996 .

[18]  Christopher K. I. Williams Prediction with Gaussian Processes: From Linear Regression to Linear Prediction and Beyond , 1999, Learning in Graphical Models.

[19]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[20]  B. Schölkopf,et al.  General cost functions for support vector regression. , 1998 .

[21]  Bernhard Schölkopf,et al.  The connection between regularization operators and support vector kernels , 1998, Neural Networks.

[22]  Federico Girosi,et al.  An Equivalence Between Sparse Approximation and Support Vector Machines , 1998, Neural Computation.

[23]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[24]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[25]  Alexander J. Smola,et al.  Regularization with Dot-Product Kernels , 2000, NIPS.

[26]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[27]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[28]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[29]  Kiyoshi Asai,et al.  Marginalized kernels for biological sequences , 2002, ISMB.

[30]  A. V. D. Vaart,et al.  Lectures on probability theory and statistics , 2002 .

[31]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[32]  Andrew W. Moore,et al.  Rapid Evaluation of Multiple Density Models , 2003, AISTATS.

[33]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[34]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[35]  Leonidas J. Guibas,et al.  Efficient Inference for Distributions on Permutations , 2007, NIPS.

[36]  Nathan Ratliff,et al.  Online) Subgradient Methods for Structured Prediction , 2007 .

[37]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[38]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[39]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[40]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[41]  I. Kondor,et al.  Group theoretical methods in machine learning , 2008 .

[42]  Alexander G. Gray,et al.  Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method , 2008, NIPS.

[43]  Bernard Chazelle,et al.  The Fast Johnson--Lindenstrauss Transform and Approximate Nearest Neighbors , 2009, SIAM J. Comput..

[44]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[45]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[46]  Alexander J. Smola,et al.  Bundle Methods for Regularized Risk Minimization , 2010, J. Mach. Learn. Res..

[47]  Anirban Dasgupta,et al.  Fast locality-sensitive hashing , 2011, KDD.

[48]  Joel A. Tropp,et al.  Improved Analysis of the subsampled Randomized Hadamard Transform , 2010, Adv. Data Sci. Adapt. Anal..

[49]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[50]  Rong Jin,et al.  Improved Bound for the Nystrom's Method and its Application to Kernel Classification , 2011, ArXiv.

[51]  Abhimanyu Das,et al.  Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection , 2011, ICML.

[52]  I. BOGAERT,et al.  O(1) Computation of Legendre Polynomials and Gauss-Legendre Nodes and Weights for Parallel Computing , 2012, SIAM J. Sci. Comput..

[53]  Alexander J. Smola,et al.  Linear support vector machines via dual cached loops , 2012, KDD.

[54]  Rong Jin,et al.  Improved Bounds for the Nyström Method With Application to Kernel Classification , 2011, IEEE Transactions on Information Theory.

[55]  Zoubin Ghahramani,et al.  The Random Forest Kernel and other kernels for big data from random partitions , 2014, ArXiv.