A Complementary Optimization Procedure for Final Cluster Analysis of Clustering Categorical Data

Clustering analysis has become an indispensable tool for obtaining and analyzing meaningful groups, irrespective of any numerical or categorical clustering problems. Algorithms such as fuzzy k-Modes, New fuzzy k-Modes, k-AMH, and the extended k-AMH algorithms such as Nk-AMH I, II, and III are usually employed to improve clustering of categorical problems. However, the performance of these algorithms is measured and evaluated according to the average accuracy scores taken from 100-run experiments, which require labeled data. Thus, the performance of the algorithms on unlabeled data cannot be measured explicitly. This paper extends complementary optimization procedures on the k-AMH model, known as Ck-AMH I, II, III, and IV, to obtain final and optimal clustering results. In experiments conducted, the complementary procedures produced optimal clustering results when tested on five categorical datasets: Soybean, Zoo, Hepatitis, Voting, and Breast. The optimal accuracy scores obtained were marginally lower than the maximum accuracy scores and, in some cases, were identical to the maximum accuracy scores obtained from the 100-run experiments. Consequently, using the complementary procedures, these clustering algorithms can be further developed as workbench clustering tools to cluster both unlabeled categorical and unlabeled numerical data.

[1]  Azizian Mohd Sapawi,et al.  Towards Development of Clustering Applications for Large-Scale Comparative Genotyping and Kinship Analysis Using Y-Short Tandem Repeats , 2015, Omics : a journal of integrative biology.

[2]  Michael K. Ng,et al.  A new fuzzy k-modes clustering algorithm for categorical data , 2009, Int. J. Granul. Comput. Rough Sets Intell. Syst..

[3]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[4]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[5]  Keith Case,et al.  Component grouping for GT applications—a fuzzy clustering approach with validity measure , 1995 .

[6]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[8]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[9]  Zainab Abu Bakar,et al.  A Medoid-based Method for Clustering Categorical Data , 2013 .

[10]  Z. Bakar,et al.  An efficient clustering algorithm for partitioning Y-short tandem repeats data , 2012, BMC Research Notes.

[11]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[12]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[13]  Ali Seman,et al.  Performance evaluations of κ-Approximate modal Haplotype type algorithms for clustering categorical data , 2015 .

[14]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[16]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[17]  Zainab Abu Bakar,et al.  Evaluation of k-modes-type algorithms for clustering Y-short tandem repeats data , 2012 .