A k-means based co-clustering (kCC) algorithm for sparse, high dimensional data

Abstract The k-means algorithm is a widely used method that starts with an initial partitioning of the data and then iteratively converges towards the local solution by reducing the Sum of Squared Errors (SSE). It is known to suffer from the cluster center initialization problem and the iterative step simply (re-)labels the data points based on the initial partition. Most improvements to k-means proposed in the literature focus on the initialization step alone but make no attempt to guide the iterative convergence by exploiting statistical information from the data. Using higher order statistics (such as paths from random walks in a graph) and the duality in the data (as in co-clustering), for instance, are known ways to improve the clustering results. What is unique and significant in our proposed approach is that we embed these concepts into the k-means algorithm rather than just using them as an external distance measure and present a unified framework called the k-means based co-clustering (kCC) Algorithm. The initialization step has been modified to include multiple points to represent each cluster center such that points within a cluster are close together but are far from points representing other clusters. Moreover, neighborhood walk statistics is proposed as a semantic similarity technique for both cluster assignment and center re-estimation in the iterative process. The effectiveness of the combined approach is evaluated on several standard data sets. Our results show that kCC performs better as compared to the baseline k-means and other state-of-the-art improvements.

[1]  Stuart A. Roberts,et al.  New methods for the initialisation of clusters , 1996, Pattern Recognit. Lett..

[2]  Abdolreza Hatamlou,et al.  Black hole: A new heuristic optimization approach for data clustering , 2013, Inf. Sci..

[3]  Gadadhar Sahoo,et al.  A New Initialization Method to Originate Initial Cluster Centers for K-Means Algorithm , 2014 .

[4]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[5]  Feng Jiang,et al.  Initialization of K-modes clustering using outlier detection techniques , 2016, Inf. Sci..

[6]  Bo Yu,et al.  Combining Statistical Information and Distance Computation for K-Means Initialization , 2016, 2016 12th International Conference on Semantics, Knowledge and Grids (SKG).

[7]  Sina Khanmohammadi,et al.  An improved overlapping k-means clustering method for medical applications , 2017, Expert Syst. Appl..

[8]  A. Rama Mohan Reddy,et al.  An efficient k-means clustering filtering algorithm using density based initial cluster centers , 2017, Inf. Sci..

[9]  H. Müller,et al.  Maximin estimation of multidimensional boundaries , 1994 .

[10]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[11]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[12]  Seiji Yamada,et al.  Careful Seeding Method based on Independent Components Analysis for k-means Clustering , 2012 .

[13]  Syed Fawad Hussain,et al.  CCGA: Co-similarity based Co-clustering using genetic algorithm , 2018, Appl. Soft Comput..

[14]  Jiye Liang,et al.  Fast global k-means clustering based on local geometrical information , 2013, Inf. Sci..

[15]  Shehroz S. Khan,et al.  Cluster center initialization algorithm for K-means clustering , 2004, Pattern Recognit. Lett..

[16]  Abdolreza Hatamlou,et al.  In search of optimal centroids on data clustering using a binary search algorithm , 2012, Pattern Recognit. Lett..

[17]  Mohammad Al Hasan,et al.  Robust partitional clustering by outlier and density insensitive seeding , 2009, Pattern Recognit. Lett..

[18]  Rehab Duwairi,et al.  A novel approach for initializing the spherical K-means clustering algorithm , 2015, Simul. Model. Pract. Theory.

[19]  Roger Lee,et al.  Text Document Clustering: The Application of Cluster Analysis to Textual Document , 2016, 2016 International Conference on Computational Science and Computational Intelligence (CSCI).

[20]  Syed Fawad Hussain,et al.  On retrieving intelligently plagiarized documents using semantic similarity , 2015, Eng. Appl. Artif. Intell..

[21]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[22]  Syed Fawad Hussain,et al.  Biclustering of human cancer microarray data using co-similarity based co-clustering , 2016, Expert Syst. Appl..

[23]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[24]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[25]  Patricio A. Vela,et al.  A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[26]  Syed Fawad Hussain,et al.  Co-clustering of multi-view datasets , 2015, Knowledge and Information Systems.

[27]  Stephen J. Redmond,et al.  A method for initialising the K-means clustering algorithm using kd-trees , 2007, Pattern Recognit. Lett..

[28]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[29]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Mothd Belal Al-Daoud A New Algorithm for Cluster Initialization , 2005, WEC.

[31]  Ting Su,et al.  In search of deterministic methods for initializing K-means and Gaussian mixture clustering , 2007, Intell. Data Anal..

[32]  Djemel Ziou,et al.  Segmentation of Terahertz imaging using k-means clustering based on ranked set sampling , 2015, Expert Syst. Appl..

[33]  Gilles Bisson,et al.  Chi-Sim: A New Similarity Measure for the Co-clustering Task , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[34]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[35]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[36]  Jiye Liang,et al.  An initialization method for the K-Means algorithm using neighborhood model , 2009, Comput. Math. Appl..

[37]  Gilles Bisson,et al.  Co-clustering of Multi-view Datasets: A Parallelizable Approach , 2012, 2012 IEEE 12th International Conference on Data Mining.

[38]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[39]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[40]  Renato Cordeiro de Amorim,et al.  Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering , 2012, Pattern Recognit..

[41]  Christian Sohler,et al.  Theoretical Analysis of the k-Means Algorithm - A Survey , 2016, Algorithm Engineering.

[42]  Syed Fawad Hussain Bi-clustering Gene Expression Data Using Co-similarity , 2011, ADMA.

[43]  Domenico Talia,et al.  A Divise Initialisation Method for Clustering Algorithms , 1999, PKDD.