I-k-means-+: An iterative clustering algorithm based on an enhanced version of the k-means

Abstract The k-means tries to minimize the sum of the squared Euclidean distance from the mean (SSEDM) of each cluster as its objective function. Although this algorithm is effective, it is too sensitive to initial centers. So, many approaches in the literature have focused on determining suitable initial centers. However, selecting suitable initial centers is not always possible, especially when the number of clusters is increased. This paper proposes an iterative approach to improve quality of the solution produced by the k-means. This approach tries to iteratively improve the quality of solution of the k-means by removing one cluster (minus), dividing another one (plus), and applying re-clustering again, in each iteration. This method called iterative k-means minus–plus (I-k-means−+). The I-k-means−+ is speeded up using some methods to determine which cluster should be removed, which one should be divided, and how to accelerate the re-clustering process. Results of experiments show that I-k-means−+ can outperform k-means++, to be known one of the accurate version of the k-means, in terms of minimizing SSEDM. For some instances, the accuracy of I-k-means−+ is about 2 times higher than both the k-means and k-means++, while it is faster than k-means++, and has the reasonable runtime, in comparison with the k-means.

[1]  Peter Wai-Ming Tsang,et al.  eXploratory K-Means: A new simple and efficient algorithm for gene clustering , 2012, Appl. Soft Comput..

[2]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[3]  Wenjie Li,et al.  A spectral analysis approach to document summarization: Clustering and ranking sentences simultaneously , 2011, Inf. Sci..

[4]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[5]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[6]  Leonardo Torok,et al.  k-MS: A novel clustering algorithm based on morphological reconstruction , 2017, Pattern Recognit..

[7]  Ting Su,et al.  A deterministic method for initializing K-means clustering , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[8]  Christos Boutsidis,et al.  Randomized Dimensionality Reduction for $k$ -Means Clustering , 2011, IEEE Transactions on Information Theory.

[9]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[10]  Patricio A. Vela,et al.  A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[11]  Ignazio Gallo,et al.  An online document clustering technique for short web contents , 2009, Pattern Recognit. Lett..

[12]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[13]  Aristidis Likas,et al.  The MinMax k-Means clustering algorithm , 2014, Pattern Recognit..

[14]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[15]  Hong Zhou,et al.  Accurate integration of multi-view range images using k-means clustering , 2008, Pattern Recognit..

[16]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[17]  C.-C. Jay Kuo,et al.  A new initialization technique for generalized Lloyd iteration , 1994, IEEE Signal Processing Letters.

[18]  Ujjwal Maulik,et al.  Rough Possibilistic Type-2 Fuzzy C-Means clustering for MR brain image segmentation , 2016, Appl. Soft Comput..

[19]  Li Xiao,et al.  A sample-based hierarchical adaptive K-means clustering method for large-scale video retrieval , 2013, Knowl. Based Syst..

[20]  Sangkyum Kim,et al.  A general framework for efficient clustering of large datasets based on activity detection , 2011, Stat. Anal. Data Min..

[21]  M. Hasan Shaheed,et al.  Cancer classification using a novel gene selection approach by means of shuffling based on data clustering with optimization , 2015, Appl. Soft Comput..

[22]  Sangkyum Kim,et al.  GAD: General Activity Detection for Fast Clustering on Large Data , 2009, SDM.

[23]  Hassan Abolhassani,et al.  Harmony K-means algorithm for document clustering , 2009, Data Mining and Knowledge Discovery.

[24]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[25]  Adil M. Bagirov,et al.  Fast modified global k-means algorithm for incremental cluster construction , 2011, Pattern Recognit..

[26]  Jim Z. C. Lai,et al.  Fast global k-means clustering using cluster membership and inequality , 2010, Pattern Recognit..

[27]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[28]  B. Baranidharan,et al.  DUCF: Distributed load balancing Unequal Clustering in wireless sensor networks using Fuzzy approach , 2016 .

[29]  Yi-Ching Liaw,et al.  Improvement of the k , 2008, Pattern Recognit..

[30]  Shaojie Qiao,et al.  A new blockmodeling based hierarchical clustering algorithm for web social networks , 2012, Eng. Appl. Artif. Intell..

[31]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[32]  Dong-Sik Jang,et al.  Extraction of major object features using VQ clustering for content-based image retrieval , 2002, Pattern Recognit..

[33]  Murat Erisoglu,et al.  A new algorithm for initial cluster centers in k-means algorithm , 2011, Pattern Recognit. Lett..

[34]  A. Chitra,et al.  Paraphrase Extraction using fuzzy hierarchical clustering , 2015, Appl. Soft Comput..

[35]  Greg Hamerly,et al.  Making k-means Even Faster , 2010, SDM.

[36]  Adil M. Bagirov,et al.  Modified global k-means algorithm for minimum sum-of-squares clustering problems , 2008, Pattern Recognit..

[37]  Stephen J. Redmond,et al.  A method for initialising the K-means clustering algorithm using kd-trees , 2007, Pattern Recognit. Lett..

[38]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[39]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .