Initializing k-means Clustering by Bootstrap and Data Depth

The k-means algorithm is widely used in various research fields because of its fast convergence to the cost function minima; however, it frequently gets stuck in local optima as it is sensitive to initial conditions. This paper explores a simple, computationally feasible method, which provides k-means with a set of initial seeds to cluster datasets of arbitrary dimensions. Our technique consists of two stages: firstly, we use the original data space to obtain a set of prototypes (cluster centers) by applying k-means to bootstrap replications of the data and, secondly, we cluster the space of centers, which has tighter (thus easier to separate) groups, and search the deepest point in each assembled cluster using a depth notion. We test this method with simulated and real data, compare it with commonly used k-means initialization algorithms, and show that it is feasible and more efficient than previous proposals in many situations.

[1]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[2]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Rebecka Jörnsten,et al.  A Robust Clustering Method and Visualization Tool Based on Data Depth , 2002 .

[4]  D. Steinley Profiling local optima in K-means clustering: developing a diagnostic technique. , 2006, Psychological methods.

[5]  K. Pillai,et al.  ON SOlVJE DIS1RIBUTION PROBLEMS IN MULTIVARIATE ANALYSIS , 1954 .

[6]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[7]  Carlos Garcia,et al.  BoCluSt: bootstrap clustering stability algorithm for community detection in networks , 2014, bioRxiv.

[8]  Stephen J. Redmond,et al.  A method for initialising the K-means clustering algorithm using kd-trees , 2007, Pattern Recognit. Lett..

[9]  Girish N. Punj,et al.  Cluster Analysis in Marketing Research: Review and Suggestions for Application , 1983 .

[10]  H. Oja Descriptive Statistics for Multivariate Distributions , 1983 .

[11]  Friedrich Leisch,et al.  Behavioral Market Segmentation of Binary Guest Survey Data with Bagged Clustering , 2001, ICANN.

[12]  Tommi Kärkkäinen,et al.  Robust refinement of initial prototypes for partitioning-based clustering algorithms , 2007 .

[13]  M. Emre Celebi,et al.  Improving the performance of k-means for color quantization , 2011, Image Vis. Comput..

[14]  J. Romo,et al.  On the Concept of Depth for Functional Data , 2009 .

[15]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[16]  Iven Van Mechelen,et al.  On the Added Value of Bootstrap Analysis for K-Means Clustering , 2015, Journal of Classification.

[17]  M. Brusco,et al.  A variable-selection heuristic for K-means clustering , 2001 .

[18]  Wesam M. Ashour,et al.  Efficient and Fast Initialization Algorithm for K- means Clustering , 2012 .

[19]  Christian Hennig,et al.  Cluster-wise assessment of cluster stability , 2007, Comput. Stat. Data Anal..

[20]  Ting Su,et al.  In search of deterministic methods for initializing K-means and Gaussian mixture clustering , 2007, Intell. Data Anal..

[21]  Cun-Hui Zhang,et al.  The multivariate L1-median and associated data depth. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Michael J. Brusco,et al.  Initializing K-means Batch Clustering: A Critical Evaluation of Several Techniques , 2007, J. Classif..

[23]  Patricio A. Vela,et al.  A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[24]  C.-C. Jay Kuo,et al.  A new initialization technique for generalized Lloyd iteration , 1994, IEEE Signal Processing Letters.

[25]  D. Pollard A Central Limit Theorem for $k$-Means Clustering , 1982 .

[26]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[27]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[28]  M. Brusco Clustering binary data in the presence of masking variables. , 2004, Psychological methods.

[29]  L. Hubert,et al.  Comparing partitions , 1985 .

[30]  G. W. Milligan,et al.  The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[32]  Rebecka Jörnsten Clustering and classification based on the L 1 data depth , 2004 .

[33]  Meena Mahajan,et al.  The planar k-means problem is NP-hard , 2012, Theor. Comput. Sci..

[34]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[35]  Ranjan Maitra,et al.  Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms , 2010 .

[36]  Nenad Mladenovic,et al.  On Strategies to Fix Degenerate k-means Solutions , 2017, J. Classif..

[37]  Prasanta K. Jana,et al.  MST-Based Cluster Initialization for K -Means , 2011 .

[38]  D. Steinley Properties of the Hubert-Arabie adjusted Rand index. , 2004, Psychological methods.

[39]  G. W. Milligan,et al.  A study of standardization of variables in cluster analysis , 1988 .

[40]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[41]  Sandro Vega-Pons,et al.  A Survey of Clustering Ensemble Algorithms , 2011, Int. J. Pattern Recognit. Artif. Intell..

[42]  Fan Yang,et al.  An Improved Initialization Center Algorithm for K-Means Clustering , 2010, 2010 International Conference on Computational Intelligence and Software Engineering.

[43]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[44]  Shengsheng Yu,et al.  Adaptive Initialization Method Based on Spatial Local Information for -Means Algorithm , 2014 .

[45]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[46]  Douglas Steinley,et al.  Local optima in K-means clustering: what you don't know may hurt you. , 2003, Psychological methods.

[47]  Man Lan,et al.  Initialization of cluster refinement algorithms: a review and comparative study , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[48]  Julius T. Tou,et al.  Pattern Recognition Principles , 1974 .

[49]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[50]  Shehroz S. Khan,et al.  Cluster center initialization algorithm for K-means clustering , 2004, Pattern Recognit. Lett..

[51]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[52]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Aurora Torrente,et al.  DepthTools: an R package for a robust analysis of gene expression data , 2013, BMC Bioinformatics.

[54]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[55]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[56]  David J. Hand,et al.  Short communication: Optimising k-means clustering results with standard software packages , 2005 .

[57]  Regina Y. Liu On a Notion of Data Depth Based on Random Simplices , 1990 .