DenClust: A Density Based Seed Selection Approach for K-Means

In this paper we present a clustering technique called DenClust that produces high quality initial seeds through a deterministic process without requiring an user input on the number of clusters k and the radius of the clusters r. The high quality seeds are given input to K-Means as the set of initial seeds to produce the final clusters. DenClust uses a density based approach for initial seed selection. It calculates the density of each record, where the density of a record is the number of records that have the minimum distances with the record. This approach is expected to produce high quality initial seeds for K-Means resulting in high quality clusters from a dataset. The performance of DenClust is compared with five (5) existing techniques namely CRUDAW, AGCUK, Simple K-means (SK), Basic Farthest Point Heuristic (BFPH) and New Farthest Point Heuristic (NFPH) in terms of three (3) external cluster evaluation criteria namely F-Measure, Entropy, Purity and two (2) internal cluster evaluation criteria namely Xie-Beni Index (XB) and Sum of Square Error (SSE). We use three (3) natural datasets that we obtain from the UCI machine learning repository. DenClust performs better than all five existing techniques in terms of all five evaluation criteria for all three datasets used in this study.

[1]  Xiao Han,et al.  A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data , 2012, Knowl. Based Syst..

[2]  You Yang,et al.  Experimental study on the five sort algorithms , 2011, 2011 Second International Conference on Mechanic Automation and Control Engineering.

[3]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[4]  Adil M. Bagirov,et al.  Modified global k-means algorithm for minimum sum-of-squares clustering problems , 2008, Pattern Recognit..

[5]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[6]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[7]  Pravin M. Vaidya,et al.  AnO(n logn) algorithm for the all-nearest-neighbors Problem , 1989, Discret. Comput. Geom..

[8]  Fouad Khan,et al.  An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application , 2016, Appl. Soft Comput..

[9]  Yusu Wang Approximating nearest neighbor among triangles in convex position , 2008, Inf. Process. Lett..

[10]  Ugur Erkin Kocamaz,et al.  Increasing the efficiency of quicksort using a neural network based algorithm selection model , 2013, Inf. Sci..

[11]  Zengyou He,et al.  Farthest-Point Heuristic based Initialization Methods for K-Modes Clustering , 2006, ArXiv.

[12]  Kai Ming Ting,et al.  A general stochastic clustering method for automatic cluster discovery , 2011, Pattern Recognit..

[13]  Md Zahidul Islam,et al.  Seed-Detective: A Novel Clustering Technique Using High Quality Seed for K-Means on Categorical and Numerical Attributes , 2011, AusDM.

[14]  Ujjwal Maulik,et al.  Towards improving fuzzy clustering using support vector machine: Application to gene expression data , 2009, Pattern Recognit..

[15]  Sameer A. Nene,et al.  A simple algorithm for nearest neighbor search in high dimensions , 1997 .

[16]  Md Zahidul Islam,et al.  CRUDAW: A Novel Fuzzy Technique for Clustering Records Following User Defined Attribute Weights , 2012, AusDM.

[17]  Jiye Liang,et al.  An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data , 2011, Knowl. Based Syst..

[18]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[19]  Ljiljana Brankovic,et al.  VICUS - A Noise Addition Technique for Categorical Data , 2012, AusDM.

[20]  Xindong Wu,et al.  Automatic clustering using genetic algorithms , 2011, Appl. Math. Comput..