DiP-SVM : Distribution Preserving Kernel Support Vector Machine for Big Data

In literature, the task of learning a support vector machine for large datasets has been performed by splitting the dataset into manageable sized “partitions” and training a sequential support vector machine on each of these partitions separately to obtain local support vectors. However, this process invariably leads to the loss in classification accuracy as global support vectors may not have been chosen as local support vectors in their respective partitions. We hypothesize that retaining the original distribution of the dataset in each of the partitions can help solve this issue. Hence, we present DiP-SVM, a distribution preserving kernel support vector machine where the first and second order statistics of the entire dataset are retained in each of the partitions. This helps in obtaining local decision boundaries which are in agreement with the global decision boundary, thereby reducing the chance of missing important global support vectors. We show that DiP-SVM achieves a minimal loss in classification accuracy among other distributed support vector machine techniques on several benchmark datasets. We further demonstrate that our approach reduces communication overhead between partitions leading to faster execution on large datasets and making it suitable for implementation in cloud environments.

[1]  Maozhen Li,et al.  A distributed SVM for scalable image annotation , 2011, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[2]  John R. Williams,et al.  Parallel multiclass classification using SVMs on GPUs , 2010, GPGPU-3.

[3]  Maozhen Li,et al.  A MapReduce-based distributed SVM algorithm for automatic image annotation , 2011, Comput. Math. Appl..

[4]  Ji Gao,et al.  Fast training Support Vector Machines using parallel sequential minimal optimization , 2008, 2008 3rd International Conference on Intelligent System and Knowledge Engineering.

[5]  Luca Zanni,et al.  Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems , 2006, J. Mach. Learn. Res..

[6]  Tony R. Martinez,et al.  Distribution-balanced stratified cross-validation for accuracy estimation , 2000, J. Exp. Theor. Artif. Intell..

[7]  Jiawei Han,et al.  Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing , 2005, Data Mining and Knowledge Discovery.

[8]  Laura Schweitzer,et al.  Advances In Kernel Methods Support Vector Learning , 2016 .

[9]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[10]  Ke Xu,et al.  A MapReduce based Parallel SVM for Email Classification , 2014, J. Networks.

[11]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[12]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[13]  George Bosilca,et al.  Open MPI: A High-Performance, Heterogeneous MPI , 2006, 2006 IEEE International Conference on Cluster Computing.

[14]  Igor Durdanovic,et al.  Parallel Support Vector Machines: The Cascade SVM , 2004, NIPS.

[15]  Maozhen Li,et al.  A distributed SVM for image annotation , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[16]  L. Deng,et al.  The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web] , 2012, IEEE Signal Processing Magazine.

[17]  Ferhat Özgür Çatak,et al.  CloudSVM: Training an SVM Classifier in Cloud Computing Systems , 2012, ICPCA/SWS.

[18]  Geoffrey Fox,et al.  Study on Parallel SVM Based on MapReduce , 2012 .

[19]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[20]  Vwani P. Roychowdhury,et al.  Distributed Parallel Support Vector Machines in Strongly Connected Networks , 2008, IEEE Transactions on Neural Networks.

[21]  Inderjit S. Dhillon,et al.  A Divide-and-Conquer Solver for Kernel Support Vector Machines , 2013, ICML.

[22]  Stan Matwin,et al.  A distributed instance-weighted SVM algorithm on large-scale imbalanced datasets , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[23]  Le Song,et al.  CA-SVM: Communication-Avoiding Support Vector Machines on Distributed Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[24]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[25]  Emilio Parrado-Hernández,et al.  Distributed support vector machines , 2006, IEEE Trans. Neural Networks.

[26]  Hai Jin,et al.  A distributed SVM method based on the iterative MapReduce , 2015, Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015).