A New Evolutionary Algorithm for Determining the Optimal Number of Clusters

Estimating the optimal number of clusters for a dataset is one of the most essential issues in cluster analysis. An improper pre-selection for the number of clusters might easily lead to bad clustering outcome. In this paper, we propose a new evolutionary algorithm to address this issue. Specifically, the proposed evolutionary algorithm defines a new entropy-based fitness function, and three new genetic operators for splitting, merging, and removing clusters. Empirical evaluations using the synthetic dataset and an existing benchmark show that the proposed evolutionary algorithm can exactly estimate the optimal number of clusters for a set of data

[1]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[2]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[3]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Eleazar Eskin,et al.  A GEOMETRIC FRAMEWORK FOR UNSUPERVISED ANOMALY DETECTION: DETECTING INTRUSIONS IN UNLABELED DATA , 2002 .

[5]  Juan Manuel Sáez,et al.  An Entropy Maximization Approach to Optimal Model Selection in Gaussian Mixtures , 2003, CIARP.

[6]  B. Everitt Unresolved Problems in Cluster Analysis , 1979 .

[7]  G. McLachlan On Bootstrapping the Likelihood Ratio Test Statistic for the Number of Components in a Normal Mixture , 1987 .

[8]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[9]  P. Green,et al.  On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion) , 1997 .

[10]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[11]  Nikos A. Vlassis,et al.  A kurtosis-based dynamic approach to Gaussian mixture modeling , 1999, IEEE Trans. Syst. Man Cybern. Part A.

[12]  Leonid Portnoy,et al.  Intrusion detection with unlabeled data using clustering , 2000 .

[13]  Richard C. Dubes,et al.  Cluster Analysis and Related Issues , 1993, Handbook of Pattern Recognition and Computer Vision.

[14]  Anil K. Jain,et al.  Unsupervised selection and estimation of finite mixture models , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[15]  A. K. Jain,et al.  Data Clustering : A , 2007 .

[16]  Eleazar Eskin,et al.  Anomaly Detection over Noisy Data using Learned Probability Distributions , 2000, ICML.

[17]  Thomas Bäck,et al.  An Empirical Study on GAs "Without Parameters" , 2000, PPSN.

[18]  Salvatore J. Stolfo,et al.  A Geometric Framework for Unsupervised Anomaly Detection , 2002, Applications of Data Mining in Computer Security.