Speeding-up the kernel k-means clustering method: A prototype based hybrid approach

Kernel k-means clustering method has been proved to be effective in identifying non-isotropic and linearly inseparable clusters in the input space. However, this method is not a suitable one for large datasets because of its quadratic time complexity with respect to the size of the dataset. This paper presents a simple prototype based hybrid approach to speed-up the kernel k-means clustering method for large datasets. The proposed method works in two stages. First, the dataset is partitioned into a number of small grouplets by using the leaders clustering method which takes the size of each grouplet, called the threshold t, as an input parameter. The conventional leaders clustering method is modified such that these grouplets are formed in the kernel induced feature space, but each grouplet is represented by a pattern (called its leader) in the input space. The dataset is re-indexed according to these grouplets. Later, the kernel k-means clustering method is applied over the set of leaders to derive a partition of the leaders set. Finally, each leader is replaced by its group to get a partition of the entire dataset. The time complexity as well as space complexity of the proposed method is O(n+p^2), where p is the number of leaders. The overall running time and the quality of the clustering result depends on the threshold t and the order in which the dataset is scanned. This paper presents a study on how the input parameter t affects the overall running time and the clustering quality obtained by the proposed method. Further, both theoretically and experimentally it has been shown how the order of scanning of the dataset affects the clustering result. The proposed method is also compared with the other recent methods that are proposed to speed-up the kernel k-means clustering method. Experimental study with several real world as well as synthetic datasets shows that, for an appropriate value of t, the proposed method can significantly reduce the computation time but with a small loss in clustering quality, particularly for large datasets.

[1]  Francesco Masulli,et al.  A survey of kernel and spectral methods for clustering , 2008, Pattern Recognit..

[2]  P. Viswanath,et al.  Rough-DBSCAN: A fast hybrid density based clustering method for large data sets , 2009, Pattern Recognit. Lett..

[3]  Zhongdong Wu,et al.  Fuzzy C-means clustering algorithm based on kernel method , 2003, Proceedings Fifth International Conference on Computational Intelligence and Multimedia Applications. ICCIMA 2003.

[4]  Rong Zhang,et al.  A large scale clustering scheme for kernel K-Means , 2002, Object recognition supported by user interaction for service robots.

[5]  A. Newton,et al.  Sketched symbol recognition using Zernike moments , 2004, ICPR 2004.

[6]  P. Viswanath,et al.  Rough-fuzzy weighted k-nearest leader classifier for large data sets , 2009, Pattern Recognit..

[7]  Rohilah Sahak,et al.  Choice for a support vector machine kernel function for recognizing asphyxia from infant cries , 2009, 2009 IEEE Symposium on Industrial Electronics & Applications.

[8]  L. Hubert,et al.  Comparing partitions , 1985 .

[9]  Jing Lu,et al.  Semi-supervised fuzzy clustering: A kernel-based approach , 2009, Knowl. Based Syst..

[10]  Driss Aboutajdine,et al.  Comparison of Performance between Different SVM Kernels for the Identification of Adult Video , 2011 .

[11]  P. A. Vijaya,et al.  Leaders - Subleaders: An efficient hierarchical clustering algorithm for large data sets , 2004, Pattern Recognit. Lett..

[12]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[13]  J. Flusser,et al.  Moments and Moment Invariants in Pattern Recognition , 2009 .

[14]  M. Narasimha Murty,et al.  An incremental data mining algorithm for compact realization of prototypes , 2001, Pattern Recognit..

[15]  Doheon Lee,et al.  Evaluation of the performance of clustering algorithms in kernel-induced feature space , 2005, Pattern Recognit..

[16]  P. Viswanath,et al.  l-DBSCAN : A Fast Hybrid Density Based Clustering Method , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[17]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[18]  Mark A. Girolami,et al.  Mercer kernel-based clustering in feature space , 2002, IEEE Trans. Neural Networks.

[19]  Aristidis Likas,et al.  The Global Kernel $k$-Means Algorithm for Clustering in Feature Space , 2009, IEEE Transactions on Neural Networks.

[20]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[21]  P. Viswanath,et al.  Speeding-Up the K-Means Clustering Method: A Prototype Based Approach , 2009, PReMI.

[22]  Rong Jin,et al.  Approximate kernel k-means: solution to large scale kernel clustering , 2011, KDD.

[23]  Seungjin Choi,et al.  Soft Geodesic Kernel K-Means , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[24]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[25]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[26]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[28]  T. Ravindra Babu,et al.  Comparison of genetic algorithm based prototype selection schemes , 2001, Pattern Recognit..

[29]  Aristidis Likas,et al.  The global kernel k-means clustering algorithm , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[30]  Sukumar Nandi,et al.  A distance based clustering method for arbitrary shaped clusters in large datasets , 2011, Pattern Recognit..

[31]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[32]  B. Eswara Reddy,et al.  A hybrid approach to speed-up the k-means clustering method , 2012, International Journal of Machine Learning and Cybernetics.

[33]  Robert F. Ling,et al.  Cluster analysis algorithms for data reduction and classification of objects , 1981 .

[34]  B. Eswara Reddy,et al.  A fast approximate kernel k-means clustering method for large data sets , 2011, 2011 IEEE Recent Advances in Intelligent Computational Systems.

[35]  I. Dhillon,et al.  A Unified View of Kernel k-means , Spectral Clustering and Graph Cuts , 2004 .