Distributed data augmented support vector machine on Spark

Support vector machines (SVMs) are widely-used for classification in machine learning and data mining tasks. However, they traditionally have been applied to small to medium datasets. Recent need to scale up with data size has attracted research attention to develop new methods and implementation for SVM to perform tasks at scale. Distributed SVMs are relatively new and studied recently, but the distributed implementation for SVM with data augmentation has not been developed. This paper introduces a distributed data augmentation implementation for SVM on Apache Spark, a recent advanced and popular platform for distributed computing that has been employed widely in research as well as in industry. We term our implementation sparkling vector machine (SkVM) which supports both classification and regression tasks by scanning through the data exactly once. In addition, we further develop a framework to handle the data with new classes arriving under an online classification setting where new data points can have labels that have not previously seen - a problem we term label-drift classification. We demonstrate the scalability of our proposed method on large-scale datasets with more than one hundred million data points. The experimental results show that the predictive performances of our method are comparable or better than those of baselines whilst the execution time is much faster at an order of magnitude.

[1]  Xiaoli Z. Fern,et al.  Multi-instance multi-label learning in the presence of novel class instances , 2015, ICML.

[2]  Bo Zhang,et al.  Fast Parallel SVM using Data Augmentation , 2015, ArXiv.

[3]  Trung Le,et al.  One-Pass Logistic Regression for Label-Drift and Large-Scale Classification on Distributed Systems , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[4]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[5]  Chia-Hua Ho,et al.  Recent Advances of Large-Scale Linear Classification , 2012, Proceedings of the IEEE.

[6]  Guo-Jun Qi,et al.  Online Multi-Label Active Learning for Large-Scale Multimedia Annotation , 2008 .

[7]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[8]  Trung Le,et al.  Sparkling Vector Machines , 2015 .

[9]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[10]  Allan Borodin,et al.  Online computation and competitive analysis , 1998 .

[11]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[12]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[13]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[14]  Cheng-Hao Tsai,et al.  Large-scale logistic regression and linear support vector machines using spark , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[15]  Nicholas G. Polson,et al.  Data augmentation for support vector machines , 2011 .

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Trung Le,et al.  Multiple Kernel Learning with Data Augmentation , 2016, ACML.