Spherical Random Features for Polynomial Kernels

Compact explicit feature maps provide a practical framework to scale kernel methods to large-scale learning, but deriving such maps for many types of kernels remains a challenging open problem. Among the commonly used kernels for nonlinear classification are polynomial kernels, for which low approximation error has thus far necessitated explicit feature maps of large dimensionality, especially for higher-order polynomials. Meanwhile, because polynomial kernels are unbounded, they are frequently applied to data that has been normalized to unit l2 norm. The question we address in this work is: if we know a priori that data is normalized, can we devise a more compact map? We show that a putative affirmative answer to this question based on Random Fourier Features is impossible in this setting, and introduce a new approximation paradigm, Spherical Random Fourier (SRF) features, which circumvents these issues and delivers a compact approximation to polynomial kernels for data on the unit sphere. Compared to prior work, SRF features are less rank-deficient, more compact, and achieve better kernel approximation, especially for higher-order polynomials. The resulting predictions have lower variance and typically yield better classification accuracy.

[1]  E. Kummer De integralibus quibusdam definitis et seriebus infinitis. , 1837 .

[2]  Journal für die reine und angewandte Mathematik , 1893 .

[3]  I. J. Schoenberg Metric spaces and completely monotone functions , 1938 .

[4]  T. Teichmann,et al.  Harmonic Analysis and the Theory of Probability , 1957, The Mathematical Gazette.

[5]  Hideki Isozaki,et al.  Efficient Support Vector Classifiers for Named Entity Recognition , 2002, COLING.

[6]  K. Kim,et al.  Face recognition using kernel principal component analysis , 2002, IEEE Signal Process. Lett..

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[9]  Kenneth Ward Church,et al.  Improving Random Projections Using Marginal Information , 2006, COLT.

[10]  Edward R. Scheinerman,et al.  Random Dot Product Graph Models for Social Networks , 2007, WAW.

[11]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[12]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[13]  Subhransu Maji,et al.  Max-margin additive classifiers for detection , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[14]  Cristian Sminchisescu,et al.  Random Fourier Approximations for Skewed Multiplicative Histogram Kernels , 2010, DAGM-Symposium.

[15]  C. V. Jawahar,et al.  Generalized RBF feature maps for Efficient Detection , 2010, BMVC.

[16]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[18]  Harish Karnick,et al.  Random Feature Maps for Dot Product Kernels , 2012, AISTATS.

[19]  Alexander J. Smola,et al.  Fastfood: Approximate Kernel Expansions in Loglinear Time , 2014, ArXiv.

[20]  Rasmus Pagh,et al.  Fast and scalable polynomial kernels via explicit feature maps , 2013, KDD.

[21]  Quanfu Fan,et al.  Random Laplace Feature Maps for Semigroup Kernels on Histograms , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[23]  Dennis DeCoste,et al.  Compact Random Feature Maps , 2013, ICML.

[24]  Mehryar Mohri,et al.  Foundations of Coupled Nonlinear Dimensionality Reduction , 2015, ArXiv.

[25]  Shih-Fu Chang,et al.  Compact Nonlinear Maps and Circulant Extensions , 2015, ArXiv.