CRUDAW: A Novel Fuzzy Technique for Clustering Records Following User Defined Attribute Weights

We present a novel fuzzy clustering technique called CRUDAW that allows a data miner to assign weights on the attributes of a data set based on their importance (to the data miner) for clustering. The technique uses a novel approach to select initial seeds deterministically (not randomly) using the density of the records of a data set. CRUDAW also selects the initial fuzzy membership degrees deterministically. Moreover, it uses a novel approach for measuring distance considering the user defined weights of the attributes. While measuring the distance between the values of a categorical attribute the technique takes the similarity of the values into consideration instead of considering the distance to be either 0 or 1. Complete algorithm for CRUDAW is presented in the paper. We experimentally compare our technique with a few existing techniques -- namely SABC, GFCM, and KL-FCM-GM based on various evaluation criteria called Silhouette coefficient, F-measure, purity and entropy. We also use t-test, confidence interval test and time complexity in evaluating the performance of our technique. Four data sets available from UCI machine learning repository are used in the experiments. Our experimental results indicate that CRUDAW performs significantly better than the existing techniques in producing high quality clusters.

[1]  Cun-Quan Zhang,et al.  A new clustering method and its application in social networks , 2011, Pattern Recognit. Lett..

[2]  Wei Xu,et al.  New fuzzy c-means clustering model based on the data weighted approach , 2010, Data Knowl. Eng..

[3]  Chung-Horng Lung,et al.  Applications of clustering techniques to software partitioning, recovery and restructuring , 2004, J. Syst. Softw..

[4]  Isak Gath,et al.  Unsupervised Optimal Fuzzy Clustering , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Doheon Lee,et al.  Fuzzy clustering of categorical data using fuzzy centroids , 2004, Pattern Recognit. Lett..

[6]  Wei Pan,et al.  Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data , 2006, Bioinform..

[7]  Ljiljana Brankovic,et al.  DETECTIVE: A Decision Tree Based Categorical Value Clustering and Perturbation Technique in Privacy Preserving Data Mining , 2005 .

[8]  Mohammad Al Hasan,et al.  Robust partitional clustering by outlier and density insensitive seeding , 2009, Pattern Recognit. Lett..

[9]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[10]  D. S. Moore,et al.  The Basic Practice of Statistics , 2001 .

[11]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[12]  William Andreopoulos,et al.  Clustering algorithms for categorical data , 2006 .

[13]  J. Bezdek,et al.  Recent convergence results for the fuzzy c-means clustering algorithms , 1988 .

[14]  T. Grubesic Detecting Hot Spots Using Cluster Analysis and GIS , 2007 .

[15]  Ujjwal Maulik,et al.  A new multi-objective technique for differential fuzzy clustering , 2011, Appl. Soft Comput..

[16]  Mohanad Alata,et al.  Optimizing of Fuzzy C-Means Clustering Algorithm Using GA , 2008 .

[17]  Zahidul Islam,et al.  Privacy Preservation in Data Mining Through Noise Addition , 2008 .

[18]  M. Kendall Elementary Statistics , 1945, Nature.

[19]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[20]  Xiao Han,et al.  A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data , 2012, Knowl. Based Syst..

[21]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[22]  Xiaogang Wang,et al.  Hierarchical Density-Based Clustering of Categorical Data and a Simplification , 2007, PAKDD.

[23]  C.-Y. Tsai,et al.  A purchase-based market segmentation methodology , 2004, Expert Syst. Appl..

[24]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[25]  Jiye Liang,et al.  An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data , 2011, Knowl. Based Syst..

[26]  Lipika Dey,et al.  A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set , 2007, Pattern Recognit. Lett..

[27]  Ming-Syan Chen,et al.  Clustering categorical data by utilizing the correlated-force ensemble , 2004 .

[28]  Dan L. Nicolae,et al.  A sequential clustering algorithm with applications to gene expression data , 2003 .

[29]  Andrea Schenone,et al.  A fuzzy clustering based segmentation system as support to diagnosis in medical imaging , 1999, Artif. Intell. Medicine.

[30]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[31]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[32]  Sotirios Chatzis,et al.  A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional , 2011, Expert Syst. Appl..

[33]  Md Zahidul Islam,et al.  EXPLORE: A Novel Decision Tree Classification Algorithm , 2010, BNCOD.

[34]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[35]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[36]  Mohamed S. Kamel,et al.  Enhanced bisecting k-means clustering using intermediate cooperation , 2009, Pattern Recognit..

[37]  Md Zahidul Islam,et al.  Privacy preserving data mining: A noise addition framework using a novel clustering technique , 2011, Knowl. Based Syst..

[38]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[39]  Helen Giggins Security of genetic databases. , 2009 .

[40]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[41]  L. Brankovic,et al.  DETECTIVE: a decision tree based categorical value clustering and perturbation technique for preserving privacy in data mining , 2005, INDIN '05. 2005 3rd IEEE International Conference on Industrial Informatics, 2005..

[42]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[43]  Md Zahidul Islam,et al.  Seed-Detective: A Novel Clustering Technique Using High Quality Seed for K-Means on Categorical and Numerical Attributes , 2011, AusDM.

[44]  Stephen J. Redmond,et al.  A method for initialising the K-means clustering algorithm using kd-trees , 2007, Pattern Recognit. Lett..

[45]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[46]  Witold Pedrycz,et al.  The fuzzy C-means algorithm with fuzzy P-mode prototypes for clustering objects having mixed features , 2009, Fuzzy Sets Syst..

[47]  Richard A. Johnson,et al.  Statistics: Principles and Methods , 1985 .

[48]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[49]  LiMark Junjie,et al.  Agglomerative Fuzzy K-Means Clustering Algorithm with Selection of Number of Clusters , 2008 .

[50]  Michael K. Ng,et al.  Agglomerative Fuzzy K-Means Clustering Algorithm with Selection of Number of Clusters , 2008, IEEE Transactions on Knowledge and Data Engineering.