Fast and scalable support vector clustering for large-scale data analysis

As an important boundary-based clustering algorithm, support vector clustering (SVC) benefits multiple applications for its capability of handling arbitrary cluster shapes. However, its popularity is degraded by both its highly intensive pricey computation and poor label performance which are due to redundant kernel function matrix required by estimating a support function and ineffectively checking segmers between all pairs of data points, respectively. To address these two problems, a fast and scalable SVC (FSSVC) method is proposed in this paper to achieve significant improvement on efficiency while guarantees a comparable accuracy with the state-of-the-art methods. The heart of our approach includes (1) constructing the hypersphere and support function by cluster boundaries which prunes unnecessary computation and storage of kernel functions and (2) presenting an adaptive labeling strategy which decomposes clusters into convex hulls and then employs a convex-decomposition-based cluster labeling algorithm or cone cluster labeling algorithm on the basis of whether the radius of the hypersphere is greater than 1. Both theoretical analysis and experimental results (e.g., the first rank of a nonparametric statistical test) show the superiority of our method over the others, especially for large-scale data analysis under limited memory requirements.

[1]  Robert P. W. Duin,et al.  Support vector domain description , 1999, Pattern Recognit. Lett..

[2]  Daewon Lee,et al.  Equilibrium-Based Support Vector Machine for Semisupervised Classification , 2007, IEEE Transactions on Neural Networks.

[3]  M. Cugmas,et al.  On comparing partitions , 2015 .

[4]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[5]  SuMing-Yang Using clustering to improve the KNN-based classifiers for online anomaly network traffic identification , 2011 .

[6]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[7]  J. Chiang,et al.  A new kernel-based fuzzy clustering approach: support vector clustering with cell growing , 2003, IEEE Trans. Fuzzy Syst..

[8]  Daewon Lee,et al.  An improved cluster labeling method for support vector clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[10]  Daewon Lee,et al.  Dynamic Characterization of Cluster Structures for Robust and Inductive Support Vector Clustering , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Andreas Krause,et al.  Advances in Neural Information Processing Systems (NIPS) , 2014 .

[12]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[13]  Ickjai Lee,et al.  Criteria on Proximity Graphs for Boundary Extraction and Spatial Clustering , 2001, PAKDD.

[14]  Kyu-Hwan Jung,et al.  Dynamic pattern denoising method using multi-basin system with kernels , 2011, Pattern Recognit..

[15]  Daewon Lee,et al.  Constructing Sparse Kernel Machines Using Attractors , 2009, IEEE Transactions on Neural Networks.

[16]  Fernando J. Von Zuben,et al.  Improving Support Vector Clustering with Ensembles , 2005 .

[17]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[18]  Jianhua Yang,et al.  Support vector clustering through proximity graph modelling , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[19]  G. Hommel,et al.  Improvements of General Multiple Test Procedures for Redundant Systems of Hypotheses , 1988 .

[20]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[21]  Yang Yi-xian,et al.  Early TCP Traffic Classification , 2011 .

[22]  Daniel S. Yeung,et al.  Ellipsoidal support vector clustering for functional MRI analysis , 2007, Pattern Recognit..

[23]  S. Abe,et al.  Spatially chunking support vector clustering algorithm , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[24]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[25]  Johan A. K. Suykens,et al.  Sparse kernel spectral clustering models for large-scale data analysis , 2011, Neurocomputing.

[26]  Karen M. Daniels,et al.  Cone Cluster Labeling for Support Vector Clustering , 2006, SDM.

[27]  Johan A. K. Suykens,et al.  Multiway Spectral Clustering with Out-of-Sample Extensions through Weighted Kernel PCA , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Sanjay Mehrotra,et al.  On the Implementation of a Primal-Dual Interior Point Method , 1992, SIAM J. Optim..

[29]  Chonghui Guo,et al.  An improved algorithm for support vector clustering based on maximum entropy principle and kernel matrix , 2011, Expert Syst. Appl..

[30]  Zhou Xu,et al.  Improved support vector clustering , 2010, Eng. Appl. Artif. Intell..

[31]  Jaewook Lee,et al.  Clustering Based on Gaussian Processes , 2007, Neural Computation.

[32]  Chang-Dong Wang,et al.  SVStream: A Support Vector-Based Algorithm for Clustering Data Streams , 2013, IEEE Transactions on Knowledge and Data Engineering.

[33]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[34]  Sakir Sezer,et al.  Classifying network protocols: A 'two-way' flow approach , 2011, IET Commun..

[35]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[36]  Francesco Camastra,et al.  A novel kernel method for clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Hsin-Chang Yang,et al.  Construction of supervised and unsupervised learning systems for multilingual text categorization , 2009, Expert Syst. Appl..

[38]  Yuhua Li,et al.  Selecting training points for one-class support vector machines , 2011, Pattern Recognit. Lett..

[39]  Cor J. Veenman,et al.  A Maximum Variance Cluster Algorithm , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  Ickjai Lee,et al.  AMOEBA: HIERARCHICAL CLUSTERING BASED ON SPATIAL PROXIMITY USING DELAUNATY DIAGRAM , 2000 .

[42]  Ohad Shamir,et al.  Stability and model selection in k-means clustering , 2010, Machine Learning.

[43]  Yixian Yang,et al.  Convex Decomposition Based Cluster Labeling Method for Support Vector Clustering , 2012, Journal of Computer Science and Technology.

[44]  Daewon Lee,et al.  Fast support-based clustering method for large-scale problems , 2010, Pattern Recognit..

[45]  Chang-Dong Wang,et al.  Position regularized Support Vector Domain Description , 2013, Pattern Recognit..

[46]  Jeen-Shing Wang,et al.  An Efficient Data Preprocessing Procedure for Support Vector Clustering , 2009, J. Univers. Comput. Sci..

[47]  Yixian Yang,et al.  A Novel Scheme for Accelerating Support Vector Clustering , 2012, Comput. Informatics.

[48]  Ming-Yang Su,et al.  Using clustering to improve the KNN-based classifiers for online anomaly network traffic identification , 2011, J. Netw. Comput. Appl..

[49]  Yuhua Li,et al.  Selecting Critical Patterns Based on Local Geometrical and Statistical Information , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[51]  Bernhard Schölkopf,et al.  A Local Learning Approach for Clustering , 2006, NIPS.

[52]  Yixian Yang,et al.  Efficient representation of text with multiple perspectives , 2012 .

[53]  Hadi Sadoghi Yazdi,et al.  An extension to fuzzy support vector data description (FSVDD*) , 2011, Pattern Analysis and Applications.

[54]  Hava T. Siegelmann,et al.  A support vector clustering method , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.