Automatic clustering by multi-objective genetic algorithm with numeric and categorical features

Abstract Many clustering algorithms categorized as K-clustering algorithm require the user to predict the number of clusters (K) to do clustering. Due to lack of domain knowledge an accurate value of K is difficult to predict. The problem becomes critical when the dimensionality of data points is large; clusters differ widely in shape, size, and density; and when clusters are overlapping in nature. Determining the suitable K is an optimization problem. Automatic clustering algorithms can discover the optimal K. This paper presents an automatic clustering algorithm which is superior to K-clustering algorithm as it can discover an optimal value of K. Iterative hill-climbing algorithms like K-Means work on a single solution and converge to a local optimum solution. Here, Genetic Algorithms (GAs) find out near global optimum solutions, i.e. optimal K as well as the optimal cluster centroids. Single-objective clustering algorithms are adequate for efficiently grouping linearly separable clusters. For non-linearly separable clusters they are not so good. So for grouping non-linearly separable clusters, we apply Multi-Objective Genetic Algorithm (MOGA) by minimizing the intra-cluster distance and maximizing inter-cluster distance. Many existing MOGA based clustering algorithms are suitable for either numeric or categorical features. This paper pioneered employing MOGA for automatic clustering with mixed types of features. Statistical testing on experimental results on real-life benchmark data sets from the University of California at Irvine (UCI) machine learning repository proves the superiority of the proposed algorithm.

[1]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[2]  F. Klawonn,et al.  Fuzzy clustering with evolutionary algorithms , 1998 .

[3]  Ashish Ghosh,et al.  Fuzzy clustering algorithms for unsupervised change detection in remote sensing images , 2011, Inf. Sci..

[4]  Chris H. Q. Ding,et al.  K-nearest-neighbor consistency in data clustering: incorporating local information into global optimization , 2004, SAC '04.

[5]  Ricardo J. G. B. Campello,et al.  Towards a Fast Evolutionary Algorithm for Clustering , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[6]  Michael J. Laszlo,et al.  A genetic algorithm that exchanges neighboring centers for k-means clustering , 2007, Pattern Recognit. Lett..

[7]  Ujjwal Maulik,et al.  A multiobjective approach to MR brain image segmentation , 2011, Appl. Soft Comput..

[8]  Ujjwal Maulik,et al.  A study of some fuzzy cluster validity indices, genetic clustering and application to pixel classification , 2005, Fuzzy Sets Syst..

[9]  Weiguo Sheng,et al.  A weighted sum validity function for clustering with a hybrid niching genetic algorithm , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[10]  Masoud Sabaei,et al.  A Novel Many-Objective Clustering Algorithm in Mobile Ad Hoc Networks , 2017, Wirel. Pers. Commun..

[11]  Wilfrido Gómez-Flores,et al.  Automatic clustering using nature-inspired metaheuristics: A survey , 2016, Appl. Soft Comput..

[12]  C. A. Murthy,et al.  In search of optimal clusters using genetic algorithms , 1996, Pattern Recognit. Lett..

[13]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[14]  Zengyou He,et al.  G-ANMI: A mutual information based genetic clustering algorithm for categorical data , 2010, Knowl. Based Syst..

[15]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[16]  Martin J. Oates,et al.  PESA-II: region-based selection in evolutionary multiobjective optimization , 2001 .

[17]  James C. Bezdek,et al.  Optimization of fuzzy clustering criteria using genetic algorithms , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[18]  Ujjwal Maulik,et al.  An Interactive Approach to Multiobjective Clustering of Gene Expression Patterns , 2013, IEEE Transactions on Biomedical Engineering.

[19]  Hong He,et al.  A two-stage genetic algorithm for automatic clustering , 2012, Neurocomputing.

[20]  Tomoyuki Hiroyasu,et al.  Multiobjective clustering with automatic k-determination for large-scale data , 2007, GECCO '07.

[21]  Chih-Chin Lai,et al.  A Novel Clustering Approach using Hierarchical Genetic Algorithms , 2005, Intell. Autom. Soft Comput..

[22]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[23]  Lihong Xu,et al.  Many-objective fuzzy centroids clustering algorithm for categorical data , 2018, Expert Syst. Appl..

[24]  Maoguo Gong,et al.  Unsupervised evolutionary clustering algorithm for mixed type data , 2010, IEEE Congress on Evolutionary Computation.

[25]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[26]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[27]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[28]  Jun Zhu,et al.  Genetic Algorithms Applied to Multi-Class Clustering for Gene Expression Data , 2003, Genomics, proteomics & bioinformatics.

[29]  Zahir Tari,et al.  A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis , 2014, IEEE Transactions on Emerging Topics in Computing.

[30]  Yong Tang,et al.  A quantum-inspired genetic algorithm for k-means clustering , 2010, Expert Syst. Appl..

[31]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[32]  Kalyanmoy Deb,et al.  An Evolutionary Many-Objective Optimization Algorithm Using Reference-Point-Based Nondominated Sorting Approach, Part I: Solving Problems With Box Constraints , 2014, IEEE Transactions on Evolutionary Computation.

[33]  Alex Alves Freitas,et al.  A Survey of Evolutionary Algorithms for Clustering , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[34]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[35]  Ming Zhang,et al.  Integrating multi-objective genetic algorithm based clustering and data partitioning for skyline computation , 2011, Applied Intelligence.

[36]  Qingfu Zhang,et al.  Objective Reduction in Many-Objective Optimization: Linear and Nonlinear Algorithms , 2013, IEEE Transactions on Evolutionary Computation.

[37]  Andreas Zell,et al.  Clustering Gene Expression Profiles with Memetic Algorithms , 2002, PPSN.

[38]  Xianda Zhang,et al.  A robust dynamic niching genetic algorithm with niche migration for automatic clustering problem , 2010, Pattern Recognit..

[39]  T Watson Layne,et al.  A Genetic Algorithm Approach to Cluster Analysis , 1998 .

[40]  Xianda Zhang,et al.  A genetic algorithm with gene rearrangement for K-means clustering , 2009, Pattern Recognit..

[41]  J. Wu,et al.  A genetic fuzzy k-Modes algorithm for clustering categorical data , 2009, Expert Syst. Appl..

[42]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  A. Ferligoj,et al.  Direct multicriteria clustering algorithms , 1992 .

[44]  Ujjwal Maulik,et al.  A Survey of Multiobjective Evolutionary Clustering , 2015, ACM Comput. Surv..

[45]  Ujjwal Maulik,et al.  Towards improving fuzzy clustering using support vector machine: Application to gene expression data , 2009, Pattern Recognit..

[46]  Nelson F. F. Ebecken,et al.  A genetic algorithm for cluster analysis , 2003, Intell. Data Anal..

[47]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Md Zahidul Islam,et al.  A hybrid clustering technique combining a novel genetic algorithm with K-Means , 2014, Knowl. Based Syst..

[49]  M. A. Chapman,et al.  Automated Road Extraction from Satellite Imagery Using Hybrid Genetic Algorithms and Cluster Analysis , 2003 .

[50]  Ricardo J. G. B. Campello,et al.  Evolving clusters in gene-expression data , 2006, Inf. Sci..

[51]  J. Sil,et al.  Clustering data set with categorical feature using multi objective genetic algorithm , 2012, 2012 International Conference on Data Science & Engineering (ICDSE).

[52]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[53]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[54]  Sam Kwong,et al.  Multi-Objective Data Clustering using Variable-Length Real Jumping Genes Genetic Algorithm and Local Search Method , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[55]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[56]  Gao Xinbo,et al.  A CSA-based clustering algorithm for large data sets with mixed numeric and categorical values , 2004, Fifth World Congress on Intelligent Control and Automation (IEEE Cat. No.04EX788).

[57]  Olli Nevalainen,et al.  Genetic Algorithms for Large-Scale Clustering Problems , 1997, Comput. J..

[58]  Jaya Sil,et al.  Clustering by multi objective genetic algorithm , 2012, 2012 1st International Conference on Recent Advances in Information Technology (RAIT).

[59]  Md Zahidul Islam,et al.  CRUDAW: A Novel Fuzzy Technique for Clustering Records Following User Defined Attribute Weights , 2012, AusDM.

[60]  Sam Kwong,et al.  Multi-Objective Evolutionary Clustering using Variable-Length Real Jumping Genes Genetic Algorithm , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[61]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[62]  Arantza Casillas,et al.  Document Clustering into an Unknown Number of Clusters Using a Genetic Algorithm , 2003, TSD.

[63]  Jaya Sil,et al.  Evolution of Genetic Algorithms in Classification Rule Mining , 2013 .

[64]  James C. Bezdek,et al.  Clustering with a genetically optimized approach , 1999, IEEE Trans. Evol. Comput..

[65]  Benjamin King Step-Wise Clustering Procedures , 1967 .

[66]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[67]  David F. Barrero,et al.  A Genetic Graph-Based Approach for Partitional Clustering , 2014, Int. J. Neural Syst..

[68]  Ujjwal Maulik,et al.  Multiobjective Genetic Clustering for Pixel Classification in Remote Sensing Imagery , 2007, IEEE Transactions on Geoscience and Remote Sensing.

[69]  Ujjwal Maulik,et al.  Fuzzy partitioning using a real-coded variable-length genetic algorithm for pixel classification , 2003, IEEE Trans. Geosci. Remote. Sens..

[70]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[71]  David E. Goldberg,et al.  A niched Pareto genetic algorithm for multiobjective optimization , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[72]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[73]  Sung-Bae Cho,et al.  Evolutionary Fuzzy Clustering Algorithm with Knowledge-Based Evaluation and Applications for Gene Expression Profiling , 2005 .

[74]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[75]  Ujjwal Maulik,et al.  Genetic clustering for automatic evolution of clusters and application to image classification , 2002, Pattern Recognit..

[76]  A. Sima Etaner-Uyar,et al.  Graph-based sequence clustering through multiobjective evolutionary algorithms for web recommender systems , 2007, GECCO '07.

[77]  Joshua D. Knowles,et al.  Clustering Criteria in Multiobjective Data Clustering , 2012, PPSN.

[78]  Jaya Sil,et al.  Data clustering with mixed features by multi objective genetic algorithm , 2012, 2012 12th International Conference on Hybrid Intelligent Systems (HIS).

[79]  Xindong Wu,et al.  Automatic clustering using genetic algorithms , 2011, Appl. Math. Comput..

[80]  Ricardo J. G. B. Campello,et al.  A Fuzzy Variant of an Evolutionary Algorithm for Clustering , 2007, 2007 IEEE International Fuzzy Systems Conference.

[81]  R. Fisher,et al.  STATISTICAL METHODS AND SCIENTIFIC INDUCTION , 1955 .

[82]  Ravi P. Agarwal,et al.  Multiple solutions for the Dirichlet second-order boundary value problem of nonsingular type , 1999 .

[83]  Hisao Ishibuchi,et al.  Evolutionary Multi-objective Rule Selection for Classification Rule Mining , 2008, Multi-Objective Evolutionary Algorithms for Knowledge Discovery from Databases.

[84]  Xin Yao,et al.  Software Module Clustering as a Multi-Objective Search Problem , 2011, IEEE Transactions on Software Engineering.

[85]  Patrick M. Reed,et al.  Borg: An Auto-Adaptive Many-Objective Evolutionary Computing Framework , 2013, Evolutionary Computation.

[86]  Jiye Liang,et al.  A new initialization method for categorical data clustering , 2009, Expert Syst. Appl..

[87]  Tomoharu Nagao,et al.  Evolutionary image segmentation based on multiobjective clustering , 2009, 2009 IEEE Congress on Evolutionary Computation.

[88]  Bidyut Baran Chaudhuri,et al.  A novel genetic algorithm for automatic clustering , 2004, Pattern Recognit. Lett..

[89]  Lin-Yu Tseng,et al.  A genetic approach to the automatic clustering problem , 2001, Pattern Recognit..

[90]  Rowena Cole,et al.  Clustering with genetic algorithms , 1998 .

[91]  Jun Du,et al.  Combining advantages of new chromosome representation scheme and multi-objective genetic algorithms for better clustering , 2006, Intell. Data Anal..

[92]  Ujjwal Maulik,et al.  Multiobjective Genetic Algorithm-Based Fuzzy Clustering of Categorical Attributes , 2009, IEEE Transactions on Evolutionary Computation.

[93]  Marco Laumanns,et al.  SPEA2: Improving the strength pareto evolutionary algorithm , 2001 .

[94]  Victor J. Rayward-Smith,et al.  A Novel Multi-Objective Genetic Algorithm for Clustering , 2011, IDEAL.

[95]  Xin Yao,et al.  An evolutionary clustering algorithm for gene expression microarray data analysis , 2006, IEEE Transactions on Evolutionary Computation.

[96]  M. Narasimha Murty,et al.  A near-optimal initial seed value selection in K-means means algorithm using a genetic algorithm , 1993, Pattern Recognit. Lett..

[97]  Enhong Chen,et al.  Dynamic Clustering Using Multi-objective Evolutionary Algorithm , 2005, CIS.

[98]  Yi Lu,et al.  FGKA: a Fast Genetic K-means Clustering Algorithm , 2004, SAC '04.

[99]  James C. Bezdek,et al.  On cluster validity for the fuzzy c-means model , 1995, IEEE Trans. Fuzzy Syst..

[100]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[101]  C. B. Lucasius,et al.  On k-medoid clustering of large data sets with the aid of a genetic algorithm: background, feasiblity and comparison , 1993 .

[102]  Ujjwal Maulik,et al.  A Survey of Multiobjective Evolutionary Algorithms for Data Mining: Part I , 2014, IEEE Transactions on Evolutionary Computation.

[103]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[104]  J. Sil,et al.  Simultaneous continuous feature selection and K clustering by Multi Objective Genetic Algorithm , 2013, 2013 3rd IEEE International Advance Computing Conference (IACC).

[105]  Ujjwal Maulik,et al.  Multiobjective Genetic Fuzzy Clustering of Categorical Attributes , 2007 .

[106]  Joshua D. Knowles,et al.  An Evolutionary Approach to Multiobjective Clustering , 2007, IEEE Transactions on Evolutionary Computation.

[107]  Jaya Sil,et al.  Simultaneous feature selection and clustering with mixed features by multi objective genetic algorithm , 2014, Int. J. Hybrid Intell. Syst..

[108]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[109]  Nazmul H. Siddique,et al.  Evolutionary multi-objective clustering for overlapping clusters detection , 2009, 2009 IEEE Congress on Evolutionary Computation.

[110]  Kalyanmoy Deb,et al.  Simulated Binary Crossover for Continuous Search Space , 1995, Complex Syst..

[111]  L. Hubert,et al.  A Graph-Theoretic Approach to Goodness-of-Fit in Complete-Link Hierarchical Clustering , 1976 .

[112]  Ricardo J. G. B. Campello,et al.  On the efficiency of evolutionary fuzzy clustering , 2009, J. Heuristics.

[113]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[114]  Jian Zhuang,et al.  Novel soft subspace clustering with multi-objective evolutionary approach for high-dimensional data , 2013, Pattern Recognit..

[115]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[116]  Brian Everitt,et al.  Cluster analysis , 1974 .

[117]  Emanuel Falkenauer,et al.  Genetic Algorithms and Grouping Problems , 1998 .

[118]  Jaya Sil,et al.  Categorical Feature Reduction Using Multi Objective Genetic Algorithm in Cluster Analysis , 2013, Trans. Comput. Sci..

[119]  Kuo-Sheng Cheng,et al.  Evolution-Based Tabu Search Approach to Automatic Clustering , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[120]  Vijay V. Raghavan,et al.  A clustering strategy based on a formalism of the reproductive process in natural systems , 1979, SIGIR 1979.

[121]  Paul Scheunders,et al.  A genetic c-Means clustering algorithm applied to color image quantization , 1997, Pattern Recognit..

[122]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[123]  Weiguo Sheng,et al.  A hybrid algorithm for k-medoid clustering of large data sets , 2004, Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753).