Not-So-Random Features

We propose a principled method for kernel learning, which relies on a Fourier-analytic characterization of translation-invariant or rotation-invariant kernels. Our method produces a sequence of feature maps, iteratively refining the SVM margin. We provide rigorous guarantees for optimality and generalization, interpreting our algorithm as online equilibrium-finding dynamics in a certain two-player min-max game. Evaluations on synthetic and real-world datasets demonstrate scalability and consistent improvements over related random features-based methods.

[1]  Constantinos Daskalakis,et al.  Near-optimal no-regret algorithms for zero-sum games , 2011, SODA '11.

[2]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[3]  Mikhail Kapralov,et al.  Sparse fourier transform in any constant dimension with nearly-optimal sample complexity in sublinear time , 2016, STOC.

[4]  Sanjiv Kumar,et al.  Orthogonal Random Features , 2016, NIPS.

[5]  Vikas Sindhwani,et al.  Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels , 2014, J. Mach. Learn. Res..

[6]  Yoav Freund,et al.  Game theory, on-line prediction and boosting , 1996, COLT '96.

[7]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[8]  R. Tweedie,et al.  Exponential convergence of Langevin distributions and their discrete approximations , 1996 .

[9]  Piotr Indyk,et al.  Simple and practical algorithm for sparse Fourier transform , 2012, SODA.

[10]  Mehryar Mohri,et al.  Generalization Bounds for Learning Kernels , 2010, ICML.

[11]  W. Rudin,et al.  Fourier Analysis on Groups. , 1965 .

[12]  Bernhard Schölkopf,et al.  Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions , 2009, NIPS.

[13]  Claus Müller Analysis of Spherical Symmetries in Euclidean Spaces , 1997 .

[14]  Jocelyn Quaintance,et al.  Spherical Harmonics and Linear Representations of Lie Groups , 2009 .

[15]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[16]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[17]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[18]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[19]  John C. Duchi,et al.  Learning Kernels with Random Features , 2016, NIPS.

[20]  Cristian Sminchisescu,et al.  Fourier Kernel Learning , 2012, ECCV.

[21]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[22]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[23]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[24]  Yurii Nesterov,et al.  Excessive Gap Technique in Nonsmooth Convex Minimization , 2005, SIAM J. Optim..

[25]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[26]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[27]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[28]  Koby Crammer,et al.  Kernel Design Using Boosting , 2002, NIPS.

[29]  Atsushi Higuchi,et al.  Symmetric tensor spherical harmonics on the N‐sphere and their application to the de Sitter group SO(N,1) , 1987 .

[30]  E. Stein,et al.  Introduction to Fourier analysis on Euclidean spaces (PMS-32) , 1972 .

[31]  Le Song,et al.  Deep Fried Convnets , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  J. Neumann On Rings of Operators. Reduction Theory , 1949 .

[33]  Le Song,et al.  A la Carte - Learning Fast Kernels , 2014, AISTATS.

[34]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[35]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[36]  Yuchen Zhang,et al.  A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics , 2017, COLT.

[37]  Mehryar Mohri,et al.  Algorithms for Learning Kernels Based on Centered Alignment , 2012, J. Mach. Learn. Res..

[38]  Barnabás Póczos,et al.  Bayesian Nonparametric Kernel-Learning , 2015, AISTATS.

[39]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[40]  Julien Mairal,et al.  End-to-End Kernel Learning with Supervised Convolutional Kernel Networks , 2016, NIPS.

[41]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[42]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[43]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[44]  A. Devinatz Integral representations of positive definite functions , 1953 .

[45]  A. Ron,et al.  Strictly positive definite functions on spheres in Euclidean spaces , 1994, Math. Comput..

[46]  Cordelia Schmid,et al.  Convolutional Kernel Networks , 2014, NIPS.