On Coresets for Support Vector Machines

We present an efficient coreset construction algorithm for large-scale Support Vector Machine (SVM) training in Big Data and streaming applications. A coreset is a small, representative subset of the original data points such that a models trained on the coreset are provably competitive with those trained on the original data set. Since the size of the coreset is generally much smaller than the original set, our preprocess-then-train scheme has potential to lead to significant speedups when training SVM models. We prove lower and upper bounds on the size of the coreset required to obtain small data summaries for the SVM problem. As a corollary, we show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings. We evaluate the performance of our algorithm on real-world and synthetic data sets. Our experimental results reaffirm the favorable theoretical properties of our algorithm and demonstrate its practical effectiveness in accelerating SVM training.

[1]  Nathan Srebro,et al.  Beating SGD: Learning SVMs in Sublinear Time , 2011, NIPS.

[2]  Kenneth L. Clarkson,et al.  Smaller core-sets for balls , 2003, SODA '03.

[3]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[4]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[5]  Dan Feldman,et al.  Provable Filter Pruning for Efficient Neural Networks , 2019, ICLR.

[6]  Sharath Raghvendra,et al.  Accurate Streaming Support Vector Machines , 2014, ArXiv.

[7]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[8]  Andreas Krause,et al.  Training Mixture Models at Scale via Coresets , 2017 .

[9]  Trupti M. Kodinariya,et al.  Review on determining number of Cluster in K-Means Clustering , 2013 .

[10]  Vladimir Braverman,et al.  New Frameworks for Offline and Streaming Coreset Constructions , 2016, ArXiv.

[11]  Dan Feldman,et al.  Core‐sets: An updated survey , 2019, WIREs Data Mining Knowl. Discov..

[12]  C. Lingard,et al.  Book Review: The Challenge of Red China , 1946 .

[13]  Andreas Krause,et al.  Practical Coreset Constructions for Machine Learning , 2017, 1703.06476.

[14]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[15]  Martin Jaggi,et al.  Coresets for polytope distance , 2009, SCG '09.

[16]  David P. Woodruff,et al.  Sublinear Optimization for Machine Learning , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[17]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[18]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[19]  Suresh Venkatasubramanian,et al.  Streamed Learning: One-Pass SVMs , 2009, IJCAI.

[20]  Dan Feldman,et al.  Data-Dependent Coresets for Compressing Neural Networks with Applications to Generalization Bounds , 2018, ICLR.

[21]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[22]  Pankaj K. Agarwal,et al.  Streaming Algorithms for Extent Problems in High Dimensions , 2010, SODA '10.

[23]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[24]  Dan Roth,et al.  Maximum Margin Coresets for Active and Noise Tolerant Learning , 2007, IJCAI.

[25]  Kenneth L. Clarkson,et al.  Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm , 2008, SODA '08.

[26]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[27]  Stéphane Canu,et al.  Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets" , 2007, J. Mach. Learn. Res..

[28]  Christopher Ré,et al.  Weighted SGD for ℓp Regression with Randomized Preconditioning , 2015, SODA.

[29]  Pramod P. Khargonekar,et al.  Fast SVM training using approximate extreme points , 2013, J. Mach. Learn. Res..

[30]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..