Cluster center initialization algorithm for K-means clustering

Performance of iterative clustering algorithms which converges to numerous local minima depend highly on initial cluster centers. Generally initial cluster centers are selected randomly. In this paper we propose an algorithm to compute initial cluster centers for K-means clustering. This algorithm is based on two observations that some of the patterns are very similar to each other and that is why they have same cluster membership irrespective to the choice of initial cluster centers. Also, an individual attribute may provide some information about initial cluster center. The initial cluster centers computed using this methodology are found to be very close to the desired cluster centers, for iterative clustering algorithms. This procedure is applicable to clustering algorithms for continuous data. We demonstrate the application of proposed algorithm to K-means clustering algorithm. The experimental results show improved and consistent solutions using the proposed algorithm.

[1]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[2]  C. A. Murthy,et al.  Density-Based Multiscale Data Condensation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[4]  H. Ralambondrainy,et al.  A conceptual version of the K-means algorithm , 1995, Pattern Recognit. Lett..

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[6]  M. Thomason Interactive Pattern Recognition , 1981, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Gregory Piatetsky-Shapiro,et al.  Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.

[8]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[9]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[10]  Rich Caruana,et al.  Meta Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[11]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[12]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[13]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[14]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[15]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[16]  Bo Thiesson,et al.  Learning Mixtures of Bayesian Networks , 1997, UAI 1997.

[17]  Euripides G. M. Petrakis,et al.  Similarity Searching in Medical Image Databases , 1997, IEEE Trans. Knowl. Data Eng..

[18]  Zengyou He,et al.  Farthest-Point Heuristic based Initialization Methods for K-Modes Clustering , 2006, ArXiv.

[19]  D. A. Bell,et al.  Applied Statistics , 1953, Nature.

[20]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[21]  Zengyou He,et al.  A cluster ensemble method for clustering categorical data , 2005, Information Fusion.

[22]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[23]  Paul S. Bradley,et al.  Clustering via Concave Minimization , 1996, NIPS.

[24]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[25]  Yi-tzuu T. Chien Interactive pattern recognition , 1978 .

[26]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[27]  Edwin Diday,et al.  Symbolic clustering using a new dissimilarity measure , 1991, Pattern Recognit..

[28]  Enrique H. Ruspini,et al.  Numerical methods for fuzzy clustering , 1970, Inf. Sci..

[29]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[30]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[31]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[32]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[33]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[34]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[35]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[36]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[37]  Shehroz S. Khan,et al.  Computing Initial points using Density Based Multiscale Data Condensation for Clustering Categorical data , 2003 .

[38]  Jiri Matas,et al.  Spatial and Feature Space Clustering: Applications in Image Analysis , 1995, CAIP.

[39]  Zhengxin Chen,et al.  An iterative initial-points refinement algorithm for categorical data clustering , 2002, Pattern Recognit. Lett..