Data subset selection for efficient SVM training

Training a support vector machine (SVM) on large data sets is a computationally intensive task. In this paper, we study the problem of selecting a subset of data for training the SVM classifier under requirement that the loss of performance due to training data reduction is low. A function quantifying suitability of a selected subset is proposed, and a greedy algorithm for solving the subset selection problem is introduced. The algorithm is evaluated on hand digit recognition and other binary classification tasks, and its performance is compared to stratified sampling methods.

[1]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[2]  Ping Li,et al.  Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) , 2014, NIPS.

[3]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[4]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[5]  Abhimanyu Das,et al.  Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection , 2011, ICML.

[6]  S. Fujishige Submodular systems and related topics , 1984 .

[7]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[8]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[9]  Rishabh K. Iyer,et al.  Submodularity in Data Subset Selection and Active Learning , 2015, ICML.

[10]  Alexandros G. Dimakis,et al.  Leveraging Sparsity for Efficient Submodular Data Summarization , 2017, NIPS.

[11]  Antônio de Pádua Braga,et al.  SVM-KM: speeding SVMs learning with a priori cluster selection and k-means , 2000, Proceedings. Vol.1. Sixth Brazilian Symposium on Neural Networks.

[12]  Jeff A. Bilmes,et al.  Submodular subset selection for large-scale speech training data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Andreas Krause,et al.  Streaming submodular maximization: massive data summarization on the fly , 2014, KDD.

[14]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[15]  Jason A. Laska,et al.  Randomized Sampling for Large Data Applications of SVM , 2012, 2012 11th International Conference on Machine Learning and Applications.

[16]  Li Qing Pre-extracting Support Vector for Support Vector Machine Based on Vector Projection , 2005 .

[17]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[18]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[19]  Marek Grochowski,et al.  Comparison of Instances Seletion Algorithms I. Algorithms Survey , 2004, ICAISC.

[20]  Stefanie Jegelka,et al.  Submodular meets Structured: Finding Diverse Subsets in Exponentially-Large Structured Item Sets , 2014, NIPS.

[21]  Wenyong Wang,et al.  An efficient instance selection algorithm to reconstruct training set for support vector machine , 2017, Knowl. Based Syst..

[22]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[23]  Haris Vikalo,et al.  Greedy sensor selection: Leveraging submodularity , 2010, 49th IEEE Conference on Decision and Control (CDC).

[24]  Robert P. W. Duin,et al.  Prototype selection for dissimilarity-based classifiers , 2006, Pattern Recognit..

[25]  Neil D. Lawrence,et al.  Fast Sparse Gaussian Process Methods: The Informative Vector Machine , 2002, NIPS.

[26]  Li Hong,et al.  An Improved SVM: NN-SVM , 2003 .

[27]  Alexandr Andoni,et al.  Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[28]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[29]  Yuh-Jye Lee,et al.  RSVM: Reduced Support Vector Machines , 2001, SDM.