A new weighting k-means type clustering framework with an l2-norm regularization

Abstract k-Means algorithm has been proven an effective technique for clustering a large-scale data set. However, traditional k-means type clustering algorithms cannot effectively distinguish the discriminative capabilities of features in the clustering process. In this paper, we present a new k-means type clustering framework by extending W-k-means with an l2-norm regularization to the weights of features. Based on the framework, we propose the l2-Wkmeans algorithm by using conventional means as the centroids for clustering numerical data sets and present the l2-NOF and l2-NDM algorithms by using two different smooth modes representatives for clustering categorical data sets. At first, a new objective function is developed for the clustering framework. Then, the corresponding updating rules of the centroids, the membership matrix, and the weights of the features, are derived theoretically for the new algorithms. We conduct extensive experimental verifications to evaluate the performances of our proposed algorithms on numerical data sets and categorical data sets. Experimental studies demonstrate that our proposed algorithms delivers consistently promising results in comparison to the other comparative approaches, such basic k-means, W-k-means, MKM_NOF, MKM_NDM etc., with respects to four metrics: Accuracy, RandIndex, Fscore, and Normal Mutual Information (NMI).

[1]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[2]  Zhaohong Deng,et al.  Enhanced soft subspace clustering integrating within-cluster and between-cluster information , 2010, Pattern Recognit..

[3]  Johannes Söding,et al.  MMseqs software suite for fast and deep clustering and searching of large protein sequence sets , 2016, Bioinform..

[4]  Jiye Liang,et al.  The k-modes type clustering plus between-cluster information for categorical data , 2014, Neurocomputing.

[5]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Masayuki Karasuyama,et al.  Multiple Graph Label Propagation by Sparse Integration , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[7]  Jiye Liang,et al.  The impact of cluster representatives on the convergence of the K-modes type clustering , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[10]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[11]  Yun Zhang,et al.  Incorporating Adaptive Local Information Into Fuzzy Clustering for Image Segmentation , 2015, IEEE Transactions on Image Processing.

[12]  Yunming Ye,et al.  TW-k-means: Automated two-level variable weighting clustering algorithm for multiview data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[13]  Doheon Lee,et al.  A k-populations algorithm for clustering categorical data , 2005, Pattern Recognit..

[14]  Jiye Liang,et al.  A novel attribute weighting algorithm for clustering high-dimensional categorical data , 2011, Pattern Recognit..

[15]  Jiye Liang,et al.  Fast global k-means clustering based on local geometrical information , 2013, Inf. Sci..

[16]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[17]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[18]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[19]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[20]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[21]  Jiye Liang,et al.  A weighting k-modes algorithm for subspace clustering of categorical data , 2013, Neurocomputing.

[22]  Adil M. Bagirov,et al.  Modified global k-means algorithm for minimum sum-of-squares clustering problems , 2008, Pattern Recognit..

[23]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  M. Oguri,et al.  HUBBLE FRONTIER FIELDS FIRST COMPLETE CLUSTER DATA: FAINT GALAXIES AT z ∼ 5–10 FOR UV LUMINOSITY FUNCTIONS AND COSMIC REIONIZATION , 2014, 1408.6903.

[25]  Jiye Liang,et al.  Space Structure and Clustering of Categorical Data , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[26]  Ohn Mar San,et al.  An alternative extension of the k-means algorithm for clustering categorical data , 2004 .

[27]  Charu C. Aggarwal,et al.  On Integrating Network and Community Discovery , 2015, WSDM.

[28]  Lipika Dey,et al.  A k-means type clustering algorithm for subspace clustering of mixed numeric and categorical datasets , 2011, Pattern Recognit. Lett..

[29]  Yunming Ye,et al.  A feature group weighting method for subspace clustering of high-dimensional data , 2012, Pattern Recognit..

[30]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[31]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[32]  Witold Pedrycz,et al.  The fuzzy C-means algorithm with fuzzy P-mode prototypes for clustering objects having mixed features , 2009, Fuzzy Sets Syst..

[33]  Yunming Ye,et al.  DSKmeans: A new kmeans-type approach to discriminative subspace clustering , 2014, Knowl. Based Syst..

[34]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[35]  Wei Sun,et al.  Regularized k-means clustering of high-dimensional data and its asymptotic consistency , 2012 .

[36]  Tommy W. S. Chow,et al.  ML-TREE: A Tree-Structure-Based Approach to Multilabel Learning , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[37]  Yue Zhao,et al.  Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup , 2015, ICML.

[38]  Zengyou He,et al.  Improving K-Modes Algorithm Considering Frequencies of Attribute Values in Mode , 2005, CIS.