Scaling Up Kernel SVM on Limited Resources: A Low-Rank Linearization Approach

Kernel support vector machines (SVMs) deliver state-of-the-art results in many real-world nonlinear classification problems, but the computational cost can be quite demanding in order to maintain a large number of support vectors. Linear SVM, on the other hand, is highly scalable to large data but only suited for linearly separable problems. In this paper, we propose a novel approach called low-rank linearized SVM to scale up kernel SVM on limited resources. Our approach transforms a nonlinear SVM to a linear one via an approximate empirical kernel map computed from efficient kernel low-rank decompositions. We theoretically analyze the gap between the solutions of the approximate and optimal rank- $k$ kernel map, which in turn provides guidance on the sampling scheme of the Nyström approximation. Furthermore, we extend it to a semisupervised metric learning scenario in which partially labeled samples can be exploited to further improve the quality of the low-rank embedding. Our approach inherits rich representability of kernel SVM and high efficiency of linear SVM. Experimental results demonstrate that our approach is more robust and achieves a better tradeoff between model representability and scalability against state-of-the-art algorithms for large-scale SVMs.

[1]  Ameet Talwalkar,et al.  Sampling Techniques for the Nystrom Method , 2009, AISTATS.

[2]  James T. Kwok,et al.  Clustered Nyström Method for Large Scale Manifold Learning and Dimension Reduction , 2010, IEEE Transactions on Neural Networks.

[3]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[4]  Czech Technical,et al.  COFFIN : A Computational Framework for Linear SVMs , 2009 .

[5]  Ivor W. Tsang,et al.  Improved Nyström low-rank approximation and error analysis , 2008, ICML '08.

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Ivor W. Tsang,et al.  Generalized Multiple Kernel Learning With Data-Dependent Priors , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[9]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[10]  Koby Crammer,et al.  Multi-Class Pegasos on a Budget , 2010, ICML.

[11]  Hui Xiong,et al.  Randomization or Condensation?: Linear-Cost Matrix Sketching Via Cascaded Compression Sampling , 2017, KDD.

[12]  Rong Jin,et al.  Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison , 2012, NIPS.

[13]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[14]  Thorsten Joachims,et al.  Sparse kernel SVMs via cutting-plane training , 2009, Machine-mediated learning.

[15]  James T. Kwok,et al.  Block-quantized kernel matrix for fast spectral embedding , 2006, ICML.

[16]  Chih-Jen Lin,et al.  Training and Testing Low-degree Polynomial Data Mappings via Linear SVM , 2010, J. Mach. Learn. Res..

[17]  Dan Roth,et al.  Selective block minimization for faster convergence of limited memory large-scale linear models , 2011, KDD.

[18]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[19]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[20]  Inderjit S. Dhillon,et al.  Low-Rank Kernel Learning with Bregman Matrix Divergences , 2009, J. Mach. Learn. Res..

[21]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[22]  Johan A. K. Suykens,et al.  Very Sparse LSSVM Reductions for Large-Scale Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[23]  Chih-Jen Lin,et al.  Large Linear Classification When Data Cannot Fit in Memory , 2011, TKDD.

[24]  Yuh-Jye Lee,et al.  RSVM: Reduced Support Vector Machines , 2001, SDM.

[25]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[26]  Dong Xu,et al.  Robust Kernel Low-Rank Representation , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[27]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[28]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[29]  Bernhard Schölkopf,et al.  A Direct Method for Building Sparse Kernel Learning Algorithms , 2006, J. Mach. Learn. Res..

[30]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[31]  Francis R. Bach,et al.  Sharp analysis of low-rank kernel matrix approximations , 2012, COLT.

[32]  Sören Sonnenburg,et al.  COFFIN: A Computational Framework for Linear SVMs , 2010, ICML.

[33]  Chih-Jen Lin,et al.  A study on reduced support vector machines , 2003, IEEE Trans. Neural Networks.

[34]  Michael I. Jordan,et al.  Predictive low-rank decomposition for kernel methods , 2005, ICML.

[35]  Igor Durdanovic,et al.  Parallel Support Vector Machines: The Cascade SVM , 2004, NIPS.

[36]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[37]  Bin Gu,et al.  Incremental Support Vector Learning for Ordinal Regression , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[38]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[39]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[40]  Ameet Talwalkar,et al.  On the Impact of Kernel Approximation on Learning Accuracy , 2010, AISTATS.

[41]  C. Micchelli Interpolation of scattered data: Distance matrices and conditionally positive definite functions , 1986 .

[42]  Bernhard Schölkopf,et al.  Cluster Kernels for Semi-Supervised Learning , 2002, NIPS.

[43]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[44]  Zheng Chen,et al.  P-packSVM: Parallel Primal grAdient desCent Kernel SVM , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[46]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[47]  Koby Crammer,et al.  Trading representability for scalability: adaptive multi-hyperplane machine for nonlinear classification , 2011, KDD.

[48]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[49]  Koby Crammer,et al.  Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale SVM training , 2012, J. Mach. Learn. Res..

[50]  Alexander J. Smola,et al.  Fastfood - Computing Hilbert Space Expansions in loglinear time , 2013, ICML.

[51]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[52]  Inderjit S. Dhillon,et al.  A Divide-and-Conquer Solver for Kernel Support Vector Machines , 2013, ICML.

[53]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[54]  Slobodan Vucetic,et al.  BudgetedSVM: a toolbox for scalable SVM approximations , 2013, J. Mach. Learn. Res..