Using Feature Clustering for GP-Based Feature Construction on High-Dimensional Data

Feature construction is a pre-processing technique to create new features with better discriminating ability from the original features. Genetic programming (GP) has been shown to be a prominent technique for this task. However, applying GP to high-dimensional data is still challenging due to the large search space. Feature clustering groups similar features into clusters, which can be used for dimensionality reduction by choosing representative features from each cluster to form the feature subset. Feature clustering has been shown promising in feature selection; but has not been investigated in feature construction for classification. This paper presents the first work of utilising feature clustering in this area. We propose a cluster-based GP feature construction method called CGPFC which uses feature clustering to improve the performance of GP for feature construction on high-dimensional data. Results on eight high-dimensional datasets with varying difficulties show that the CGPFC constructed features perform better than the original full feature set and features constructed by the standard GP constructor based on the whole feature set.

[1]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[2]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[3]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[4]  Nikhil R. Pal,et al.  A Multiobjective Genetic Programming-Based Ensemble for Simultaneous Feature Selection and Classification , 2016, IEEE Transactions on Cybernetics.

[5]  Tomoyuki Hiroyasu,et al.  A feature transformation method using multiobjective genetic programming for two-class classification , 2015, 2015 IEEE Congress on Evolutionary Computation (CEC).

[6]  Mengjie Zhang,et al.  Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach , 2013, EvoBIO.

[7]  Shie-Jue Lee,et al.  Dimensionality reduction by feature clustering for regression problems , 2015, Inf. Sci..

[8]  Yingjie Tian,et al.  A Comprehensive Survey of Clustering Algorithms , 2015, Annals of Data Science.

[9]  Mengjie Zhang,et al.  Multiple feature construction for effective biomarker identification and classification using genetic programming , 2014, GECCO.

[10]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[11]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[12]  Hui-Huang Hsu,et al.  Feature Selection via Correlation Coefficient Clustering , 2010, J. Softw..

[13]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[14]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[15]  Kapil Sharma,et al.  Clustering based feature selection methods from fMRI data for classification of cognitive states of the human brain , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[16]  Shengrui Wang,et al.  Multiple Bayesian discriminant functions for high-dimensional massive data classification , 2016, Data Mining and Knowledge Discovery.

[17]  Krzysztof Krawiec Evolutionary Feature Selection and Construction , 2010, Encyclopedia of Machine Learning.

[18]  Michel Verleysen,et al.  Feature clustering and mutual information for the selection of variables in spectral data , 2007, ESANN.

[19]  Mengjie Zhang,et al.  Multiple feature construction in classification on high-dimensional data using GP , 2016, 2016 IEEE Symposium Series on Computational Intelligence (SSCI).

[20]  Mengjie Zhang,et al.  Genetic programming for feature construction and selection in classification on high-dimensional data , 2016, Memetic Comput..

[21]  Richard Arnold,et al.  Multivariate methods using mixtures: Correspondence analysis, scaling and pattern-detection , 2014, Comput. Stat. Data Anal..

[22]  Dan A. Simovici,et al.  On feature selection through clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[23]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[24]  Michel Verleysen,et al.  Feature Scoring by Mutual Information for Classification of Mass Spectra , 2006 .

[25]  Yogesh R. Shepal A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data , 2014 .

[26]  William H. Press,et al.  Numerical recipes in C , 2002 .

[27]  Mengjie Zhang,et al.  A Filter Approach to Multiple Feature Construction for Symbolic Learning Classifiers Using Genetic Programming , 2012, IEEE Transactions on Evolutionary Computation.

[28]  Mengjie Zhang,et al.  Fitness Functions in Genetic Programming for Classification with Unbalanced Data , 2007, Australian Conference on Artificial Intelligence.

[29]  Parham Moradi,et al.  A clustering based genetic algorithm for feature selection , 2014, 2014 6th Conference on Information and Knowledge Technology (IKT).

[30]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[31]  Mengjie Zhang,et al.  PSO and Statistical Clustering for Feature Selection: A New Representation , 2014, SEAL.

[32]  Xin Yao,et al.  A Survey on Evolutionary Computation Approaches to Feature Selection , 2016, IEEE Transactions on Evolutionary Computation.

[33]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[34]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection , 1998 .

[35]  Mengjie Zhang,et al.  Gaussian Based Particle Swarm Optimisation and Statistical Clustering for Feature Selection , 2014, EvoCOP.

[36]  Ricardo J. G. B. Campello,et al.  A Cluster Based Hybrid Feature Selection Approach , 2015, 2015 Brazilian Conference on Intelligent Systems (BRACIS).

[37]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .