A Niching Memetic Algorithm for Simultaneous Clustering and Feature Selection

Clustering is inherently a difficult task, and is made even more difficult when the selection of relevant features is also an issue. In this paper we propose an approach for simultaneous clustering and feature selection using a niching memetic algorithm. Our approach (which we call NMA_CFS) makes feature selection an integral part of the global clustering search procedure and attempts to overcome the problem of identifying less promising locally optimal solutions in both clustering and feature selection, without making any a priori assumption about the number of clusters. Within the NMA_CFS procedure, a variable composite representation is devised to encode both feature selection and cluster centers with different numbers of clusters. Further, local search operations are introduced to refine feature selection and cluster centers encoded in the chromosomes. Finally, a niching method is integrated to preserve the population diversity and prevent premature convergence. In an experimental evaluation we demonstrate the effectiveness of the proposed approach and compare it with other related approaches, using both synthetic and real data.

[1]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[2]  J. Wade Davis,et al.  Statistical Pattern Recognition , 2003, Technometrics.

[3]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[4]  Thomas G. Dietterich,et al.  Learning Boolean Concepts in the Presence of Many Irrelevant Features , 1994, Artif. Intell..

[5]  Anil K. Jain,et al.  Dimensionality reduction using genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[6]  Goldberg,et al.  Genetic algorithms , 1993, Robust Control Systems with Genetic Algorithms.

[7]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[8]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..

[9]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[10]  Sankar K. Pal,et al.  Unsupervised Feature Selection , 2004 .

[11]  Carla E. Brodley,et al.  Feature Subset Selection and Order Identification for Unsupervised Learning , 2000, ICML.

[12]  Hong Yan,et al.  Cluster analysis of gene expression data based on self-splitting and merging competitive learning , 2004, IEEE Transactions on Information Technology in Biomedicine.

[13]  Pablo Moscato,et al.  On Evolution, Search, Optimization, Genetic Algorithms and Martial Arts : Towards Memetic Algorithms , 1989 .

[14]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[15]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[16]  Bernd Freisleben,et al.  Memetic Algorithms and the Fitness Landscape of the Graph Bi-Partitioning Problem , 1998, PPSN.

[17]  Subrata K. Das,et al.  Feature Selection with a Linear Dependence Measure , 1971, IEEE Transactions on Computers.

[18]  Godfried T. Toussaint,et al.  Comments on "Feature Selection with a Linear Dependence Measure" , 1972, IEEE Trans. Computers.

[19]  David E. Goldberg,et al.  Genetic Algorithms with Sharing for Multimodalfunction Optimization , 1987, ICGA.

[20]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[21]  Darrell Whitley,et al.  Modeling Hybrid Genetic Algorithms , 1995 .

[22]  Bruno Sareni,et al.  Fitness sharing and niching methods revisited , 1998, IEEE Trans. Evol. Comput..

[23]  James C. Bezdek,et al.  On cluster validity for the fuzzy c-means model , 1995, IEEE Trans. Fuzzy Syst..

[24]  Pablo Moscato,et al.  Memetic algorithms: a short introduction , 1999 .

[25]  Alain Pétrowski,et al.  A clearing procedure as a niching method for genetic algorithms , 1996, Proceedings of IEEE International Conference on Evolutionary Computation.

[26]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[27]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[28]  Jack Sklansky,et al.  Feature Selection for Automatic Classification of Non-Gaussian Data , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[29]  Kenneth Alan De Jong,et al.  An analysis of the behavior of a class of genetic adaptive systems. , 1975 .

[30]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[31]  James C. Bezdek,et al.  Clustering with a genetically optimized approach , 1999, IEEE Trans. Evol. Comput..

[32]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[33]  Huan Liu,et al.  Unsupervised Feature Ranking and Selection , 2002 .

[34]  Charles A. Micchelli,et al.  Maximum entropy and maximum likelihood criteria for feature selection from multivariate data , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[35]  David G. Stork,et al.  Pattern Classification , 1973 .

[36]  Jennifer G. Dy Unsupervised Feature Selection , 2007 .

[37]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[38]  K. Dejong,et al.  An analysis of the behavior of a class of genetic adaptive systems , 1975 .

[39]  Claire Cardie,et al.  Using Decision Trees to Improve Case-Based Learning , 1993, ICML.

[40]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[41]  Cheng-Yan Kao,et al.  An evolutionary approach for gene expression patterns , 2004, IEEE Transactions on Information Technology in Biomedicine.

[42]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[43]  G. W. Hatfield,et al.  DNA microarrays and gene expression , 2002 .

[44]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[45]  Bernd Freisleben,et al.  Fitness landscape analysis and memetic algorithms for the quadratic assignment problem , 2000, IEEE Trans. Evol. Comput..

[46]  Thomas G. Dietterich,et al.  Learning with Many Irrelevant Features , 1991, AAAI.

[47]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[48]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[49]  Samir W. Mahfoud Niching methods for genetic algorithms , 1996 .

[50]  Richard P. Heydorn,et al.  Redundancy in Feature Extraction , 1971, IEEE Transactions on Computers.

[51]  Zeev Volkovich,et al.  Text mining with information-theoretic clustering , 2003, Comput. Sci. Eng..

[52]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[53]  Weiguo Sheng,et al.  Clustering with Niching Genetic K-means Algorithm , 2004, GECCO.

[54]  Hichem Frigui,et al.  A Robust Competitive Clustering Algorithm With Applications in Computer Vision , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[55]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[56]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[57]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .