A new multiobjective clustering technique based on the concepts of stability and symmetry

Most clustering algorithms operate by optimizing (either implicitly or explicitly) a single measure of cluster solution quality. Such methods may perform well on some data sets but lack robustness with respect to variations in cluster shape, proximity, evenness and so forth. In this paper, we have proposed a multiobjective clustering technique which optimizes simultaneously two objectives, one reflecting the total cluster symmetry and the other reflecting the stability of the obtained partitions over different bootstrap samples of the data set. The proposed algorithm uses a recently developed simulated annealing-based multiobjective optimization technique, named AMOSA, as the underlying optimization strategy. Here, points are assigned to different clusters based on a newly defined point symmetry-based distance rather than the Euclidean distance. Results on several artificial and real-life data sets in comparison with another multiobjective clustering technique, MOCK, three single objective genetic algorithm-based automatic clustering techniques, VGAPS clustering, GCUK clustering and HNGA clustering, and several hybrid methods of determining the appropriate number of clusters from data sets show that the proposed technique is well suited to detect automatically the appropriate number of clusters as well as the appropriate partitioning from data sets having point symmetric clusters. The performance of AMOSA as the underlying optimization technique in the proposed clustering algorithm is also compared with PESA-II, another evolutionary multiobjective optimization technique.

[1]  Gary B. Lamont,et al.  Multiobjective Evolutionary Algorithms: Analyzing the State-of-the-Art , 2000, Evolutionary Computation.

[2]  Sanghamitra Bandyopadhyay,et al.  Classification and learning using genetic algorithms - applications in bioinformatics and web intelligence , 2007, Natural computing series.

[3]  Martin J. Oates,et al.  PESA-II: region-based selection in evolutionary multiobjective optimization , 2001 .

[4]  Richi Nayak,et al.  Fast and effective clustering of XML data using structural information , 2008, Knowledge and Information Systems.

[5]  Ujjwal Maulik,et al.  Validity index for crisp and fuzzy clusters , 2004, Pattern Recognit..

[6]  Chien-Hsing Chou,et al.  Short Papers , 2001 .

[7]  Ira Assent,et al.  Clustering multidimensional sequences in spatial and temporal databases , 2007, Knowledge and Information Systems.

[8]  Emile H. L. Aarts,et al.  Simulated Annealing: Theory and Applications , 1987, Mathematics and Its Applications.

[9]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[10]  Tao Li,et al.  Clustering based on matrix approximation: a unifying view , 2008, Knowledge and Information Systems.

[11]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[12]  Lalit M. Patnaik,et al.  Adaptive probabilities of crossover and mutation in genetic algorithms , 1994, IEEE Trans. Syst. Man Cybern..

[13]  Jiawei Han,et al.  SRDA: An Efficient Algorithm for Large-Scale Discriminant Analysis , 2008, IEEE Transactions on Knowledge and Data Engineering.

[14]  F. Attneave Symmetry, information, and memory for patterns. , 1955, The American journal of psychology.

[15]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[16]  R HruschkaEduardo,et al.  A genetic algorithm for cluster analysis , 2003 .

[17]  Joshua D. Knowles,et al.  An Evolutionary Approach to Multiobjective Clustering , 2007, IEEE Transactions on Evolutionary Computation.

[18]  Soon-H. Kwon Cluster validity index for fuzzy clustering , 1998 .

[19]  Wilfred Ng,et al.  A survey on algorithms for mining frequent itemsets over data streams , 2008, Knowledge and Information Systems.

[20]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[21]  Martin Ester,et al.  Robust projected clustering , 2007, Knowledge and Information Systems.

[22]  Weiguo Sheng,et al.  A weighted sum validity function for clustering with a hybrid niching genetic algorithm , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[23]  Ujjwal Maulik,et al.  A Simulated Annealing-Based Multiobjective Optimization Algorithm: AMOSA , 2008, IEEE Transactions on Evolutionary Computation.

[24]  Sunil Arya,et al.  ANN: library for approximate nearest neighbor searching , 1998 .

[25]  Sanghamitra Bandyopadhyay,et al.  Application of a New Symmetry-Based Cluster Validity Index for Satellite Image Segmentation , 2008, IEEE Geoscience and Remote Sensing Letters.

[26]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Anne M. Denton,et al.  Pattern-based time-series subsequence clustering using radial distribution functions , 2009, Knowledge and Information Systems.

[28]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .

[29]  Y. Fukuyama,et al.  A new method of choosing the number of clusters for the fuzzy c-mean method , 1989 .

[30]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  R. K. Ursem Multi-objective Optimization using Evolutionary Algorithms , 2009 .

[32]  Sanghamitra Bandyopadhyay,et al.  A Point Symmetry-Based Clustering Technique for Automatic Evolution of Clusters , 2008, IEEE Transactions on Knowledge and Data Engineering.

[33]  Masao Sakauchi,et al.  The BD-Tree - A New N-Dimensional Data Structure with Highly Efficient Dynamic Characteristics , 1983, IFIP Congress.

[34]  James C. Bezdek,et al.  Some new indexes of cluster validity , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[35]  Nelson F. F. Ebecken,et al.  A genetic algorithm for cluster analysis , 2003, Intell. Data Anal..

[36]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Dong-Jo Park,et al.  A Novel Validity Index for Determination of the Optimal Number of Clusters , 2001 .

[38]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[39]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[40]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[41]  Sanghamitra Bandyopadhyay,et al.  GAPS: A clustering method using a new point symmetry-based distance measure , 2007, Pattern Recognit..

[42]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[43]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[44]  S. Bandyopadhyay,et al.  Nonparametric genetic clustering: comparison of validity indices , 2001, IEEE Trans. Syst. Man Cybern. Syst..

[45]  J. Breckenridge Replicating Cluster Analysis: Method, Consistency, and Validity. , 1989, Multivariate behavioral research.

[46]  I. Guyon,et al.  Detecting stable clusters using principal component analysis. , 2003, Methods in molecular biology.

[47]  Ujjwal Maulik,et al.  Genetic clustering for automatic evolution of clusters and application to image classification , 2002, Pattern Recognit..

[48]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[49]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.