Clustering Nominal and Numerical Data: A New Distance Concept for a Hybrid Genetic Algorithm

As intrinsic structures, like the number of clusters, is, for real data, a major issue of the clustering problem, we propose, in this paper, CHyGA (Clustering Hybrid Genetic Algorithm) an hybrid genetic algorithm for clustering. CHyGA treats the clustering problem as an optimization problem and searches for an optimal number of clusters characterized by an optimal distribution of instances into the clusters. CHyGA introduces a new representation of solutions and uses dedicated operators, such as one iteration of K-means as a mutation operator. In order to deal with nominal data, we propose a new definition of the cluster center concept and demonstrate its properties. Experimental results on classical benchmarks are given.

[1]  James C. Bezdek,et al.  Genetic algorithm guided clustering , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[2]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Enrique H. Ruspini,et al.  Numerical methods for fuzzy clustering , 1970, Inf. Sci..

[4]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[5]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[6]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[7]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[8]  James C. Bezdek,et al.  Clustering with a genetically optimized approach , 1999, IEEE Trans. Evol. Comput..

[9]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[10]  El-Ghazali Talbi,et al.  A data mining approach to discover genetic and environmental factors involved in multifactorial diseases , 2002, Knowl. Based Syst..

[11]  Jin-Kao Hao,et al.  Hybrid Evolutionary Algorithms for Graph Coloring , 1999, J. Comb. Optim..

[12]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[13]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[14]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[15]  Ujjwal Maulik,et al.  Genetic clustering for automatic evolution of clusters and application to image classification , 2002, Pattern Recognit..

[16]  Emanuel Falkenauer,et al.  Genetic Algorithms and Grouping Problems , 1998 .

[17]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[18]  Donald R. Jones,et al.  Solving Partitioning Problems with Genetic Algorithms , 1991, International Conference on Genetic Algorithms.

[19]  Rowena Cole,et al.  Clustering with genetic algorithms , 1998 .

[20]  C. L. Liu,et al.  Introduction to Combinatorial Mathematics. , 1971 .