ModEx and Seed-Detective: Two novel techniques for high quality clustering by using good initial seeds in K-Means

In this paper we present two clustering techniques called ModEx and Seed-Detective. ModEx is a modified version of an existing clustering technique called Ex-Detective. It addresses some limitations of Ex-Detective. Seed-Detective is a combination of ModEx and Simple K-Means. Seed-Detective uses ModEx to produce a set of high quality initial seeds that are then given as input to K-Means for producing the final clusters. The high quality initial seeds are expected to produce high quality clusters through K-Means. The performances of Seed-Detective and ModEx are compared with the performances of Ex-Detective, PAM, Simple K-Means (SK), Basic Farthest Point Heuristic (BFPH) and New Farthest Point Heuristic (NFPH). We use three cluster evaluation criteria namely F-measure, Entropy and Purity and four natural datasets that we obtain from the UCI Machine learning repository. In the datasets our proposed techniques perform better than the existing techniques in terms of F-measure, Entropy and Purity. The sign test results suggest a statistical significance of the superiority of Seed-Detective (and ModEx) over the existing techniques.

[1]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[2]  Jiye Liang,et al.  An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data , 2011, Knowl. Based Syst..

[3]  L. Brankovic,et al.  DETECTIVE: a decision tree based categorical value clustering and perturbation technique for preserving privacy in data mining , 2005, INDIN '05. 2005 3rd IEEE International Conference on Industrial Informatics, 2005..

[4]  Zahidul Islam,et al.  Privacy Preservation in Data Mining Through Noise Addition , 2008 .

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  Fouad Khan,et al.  An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application , 2016, Appl. Soft Comput..

[7]  Zengyou He,et al.  Farthest-Point Heuristic based Initialization Methods for K-Modes Clustering , 2006, ArXiv.

[8]  Zhang Yi,et al.  Clustering Categorical Data , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[9]  Ljiljana Brankovic,et al.  DETECTIVE: A Decision Tree Based Categorical Value Clustering and Perturbation Technique in Privacy Preserving Data Mining , 2005 .

[10]  Harry Zhang,et al.  A Fast Decision Tree Learning Algorithm , 2006, AAAI.

[11]  Onaiza Maqbool,et al.  Automated software clustering: An insight using cluster labels , 2006, J. Syst. Softw..

[12]  Lukasz A. Kurgan,et al.  CAIM discretization algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[13]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[14]  Md Zahidul Islam,et al.  Privacy preserving data mining: A noise addition framework using a novel clustering technique , 2011, Knowl. Based Syst..

[15]  Ming-Syan Chen,et al.  Clustering categorical data by utilizing the correlated-force ensemble , 2004 .

[16]  Daniel Sánchez,et al.  Numerical Attributes in Decision Trees: A Hierarchical Approach , 2003, IDA.

[17]  Adil M. Bagirov,et al.  Modified global k-means algorithm for minimum sum-of-squares clustering problems , 2008, Pattern Recognit..

[18]  S. Schor STATISTICS: AN INTRODUCTION. , 1965, The Journal of trauma.

[19]  Daniel Sánchez,et al.  Building multi-way decision trees with numerical attributes , 2004, Inf. Sci..

[20]  Seo Young Kim,et al.  Effect of data normalization on fuzzy clustering of DNA microarray data , 2005, BMC Bioinformatics.

[21]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[22]  Ravinder Singh,et al.  Fast-Find: A novel computational approach to analyzing combinatorial motifs , 2006, BMC Bioinformatics.

[23]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[24]  Baharum Baharudin,et al.  Analysis of distance metrics in content-based image retrieval using statistical quantized histogram texture features in the DCT domain , 2013, J. King Saud Univ. Comput. Inf. Sci..

[25]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[26]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[27]  Md Zahidul Islam,et al.  EXPLORE: A Novel Decision Tree Classification Algorithm , 2010, BNCOD.

[28]  Geoffrey I. Webb,et al.  Discretization for naive-Bayes learning: managing discretization bias and variance , 2008, Machine Learning.

[29]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[30]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[31]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[32]  Kai Ming Ting,et al.  A general stochastic clustering method for automatic cluster discovery , 2011, Pattern Recognit..

[33]  M.-C. Su,et al.  A new cluster validity measure and its application to image compression , 2004, Pattern Analysis and Applications.

[34]  J. C. Noordam,et al.  Multivariate image segmentation with cluster size insensitive fuzzy C-means , 2002 .