Variable Weighting in Fuzzy k-Means Clustering to Determine the Number of Clusters

One of the most significant problems in cluster analysis is to determine the number of clusters in unlabeled data, which is the input for most clustering algorithms. Some methods have been developed to address this problem. However, little attention has been paid on algorithms that are insensitive to the initialization of cluster centers and utilize variable weights to recover the number of clusters. To fill this gap, we extend the standard fuzzy <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="khan-ieq1-2911582.gif"/></alternatives></inline-formula>-means clustering algorithm. It can automatically determine the number of clusters by iteratively calculating the weights of all variables and the membership value of each object in all clusters. Two new steps are added to the fuzzy <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="khan-ieq2-2911582.gif"/></alternatives></inline-formula>-means clustering process. One of them is to introduce a penalty term to make the clustering process insensitive to the initial cluster centers. The other one is to utilize a formula for iterative updating of variable weights in each cluster based on the current partition of data. Experimental results on real-world and synthetic datasets have shown that the proposed algorithm effectively determined the correct number of clusters while initializing the different number of cluster centroids. We also tested the proposed algorithm on gene data to determine a subset of important genes.

[1]  Zongwei Luo,et al.  Nonnegative Matrix Factorization Based Consensus for Clusterings With a Variable Number of Clusters , 2018, IEEE Access.

[2]  Enrique H. Ruspini,et al.  A New Approach to Clustering , 1969, Inf. Control..

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  Hichem Frigui,et al.  Fuzzy and possibilistic shell clustering algorithms and their application to boundary detection and surface approximation. II , 1995, IEEE Trans. Fuzzy Syst..

[5]  Yunming Ye,et al.  Neighborhood Density Method for Selecting Initial Cluster Centers in K-Means Clustering , 2006, PAKDD.

[6]  Nguyen Thanh Tung,et al.  Ensemble Clustering of High Dimensional Data with FastMap Projection , 2014, PAKDD Workshops.

[7]  Md Abdul Masud,et al.  CPLP: An algorithm for tracking the changes of power consumption patterns in load profile data over time , 2018, Inf. Sci..

[8]  Xiaojun Chen,et al.  Subspace Weighting Co-Clustering of Gene Expression Data , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Ming Zhong,et al.  TWCC: Automated Two-way Subspace Weighting Partitional Co-Clustering , 2018, Pattern Recognit..

[10]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Joshua Zhexue Huang,et al.  Incremental density-based ensemble clustering over evolving data streams , 2016, Neurocomputing.

[12]  Lei Xu,et al.  A Comparative Study of Several Cluster Number Selection Criteria , 2003, IDEAL.

[13]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[14]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[15]  Adele Cutler,et al.  Information Ratios for Validating Mixture Analysis , 1992 .

[16]  Athman Bouguettaya,et al.  Adaptive Subspace Symbolization for Content-Based Video Detection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[17]  Ming Zhong,et al.  I-nice: A new approach for identifying the number of clusters and initial cluster centres , 2018, Inf. Sci..

[18]  Claudio Carpineto,et al.  Consensus Clustering Based on a New Probabilistic Rand Index with Application to Subtopic Retrieval , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Michael J. Laszlo,et al.  A genetic algorithm using hyper-quadtrees for low-dimensional k-means clustering , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[21]  Shengrui Wang,et al.  FCM-Based Model Selection Algorithms for Determining the Number of Clusters , 2004, Pattern Recognit..

[22]  C. L. Philip Chen,et al.  Cluster number selection for a small set of samples using the Bayesian Ying-Yang model , 2002, IEEE Trans. Neural Networks.

[23]  James C. Bezdek,et al.  A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Chong Wang,et al.  Online Variational Inference for the Hierarchical Dirichlet Process , 2011, AISTATS.

[25]  Shaina Race,et al.  Determining the Number of Clusters via Iterative Consensus Clustering , 2014, SDM.

[26]  S. Deelers,et al.  Enhancing K-Means Algorithm with Initial Cluster Centers Derived from Data Partitioning along the Data Axis with the Highest Variance , 2007 .

[27]  Samuel J. Gershman,et al.  A Tutorial on Bayesian Nonparametric Models , 2011, 1106.2697.

[28]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[29]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[30]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[31]  Hichem Frigui,et al.  Clustering by competitive agglomeration , 1997, Pattern Recognit..

[32]  Michael K. Ng,et al.  Agglomerative Fuzzy K-Means Clustering Algorithm with Selection of Number of Clusters , 2008, IEEE Transactions on Knowledge and Data Engineering.

[33]  V. J. Rayward-Smith,et al.  Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition , 1999 .

[34]  Christian Hennig,et al.  Recovering the number of clusters in data sets with noise features using feature rescaling factors , 2015, Inf. Sci..

[35]  Yee Whye Teh,et al.  Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes , 2004, NIPS.

[36]  Yunming Ye,et al.  TW-k-means: Automated two-level variable weighting clustering algorithm for multiview data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[37]  Md Abdul Masud,et al.  Segmentation of Factories on Electricity Consumption Behaviors Using Load Profile Data , 2016, IEEE Access.

[38]  Kotagiri Ramamohanarao,et al.  Automatically Determining the Number of Clusters in Unlabeled Data Sets , 2009, IEEE Transactions on Knowledge and Data Engineering.

[39]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[40]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[41]  R. Yager,et al.  Approximate Clustering Via the Mountain Method , 1994, IEEE Trans. Syst. Man Cybern. Syst..

[42]  Weixiang Liu,et al.  An experimental comparison of gene selection by Lasso and Dantzig selector for cancer classification , 2011, Comput. Biol. Medicine.

[43]  Jing He,et al.  Clustering of Multiple Density Peaks , 2018, PAKDD.

[44]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .