Two-level k-means clustering algorithm for k-tau relationship establishment and linear-time classification

Partitional clustering algorithms, which partition the dataset into a pre-defined number of clusters, can be broadly classified into two types: algorithms which explicitly take the number of clusters as input and algorithms that take the expected size of a cluster as input. In this paper, we propose a variant of the k-means algorithm and prove that it is more efficient than standard k-means algorithms. An important contribution of this paper is the establishment of a relation between the number of clusters and the size of the clusters in a dataset through the analysis of our algorithm. We also demonstrate that the integration of this algorithm as a pre-processing step in classification algorithms reduces their running-time complexity.

[1]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[2]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[3]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[4]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[5]  Jiawei Han,et al.  Classifying large data sets using SVMs with hierarchical clusters , 2003, KDD '03.

[6]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[7]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[8]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[9]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[10]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[11]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[12]  Hui Xiong,et al.  K-means clustering versus validation measures: a data distribution perspective , 2006, KDD '06.

[13]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[14]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Sargur N. Srihari,et al.  Fast k-nearest neighbor classification using cluster-based trees , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.