A hybrid feature selection method for DNA microarray data

Gene expression profiles, which represent the state of a cell at a molecular level, have great potential as a medical diagnosis tool. In cancer classification, available training data sets are generally of a fairly small sample size compared to the number of genes involved. Along with training data limitations, this constitutes a challenge to certain classification methods. Feature (gene) selection can be used to successfully extract those genes that directly influence classification accuracy and to eliminate genes which have no influence on it. This significantly improves calculation performance and classification accuracy. In this paper, correlation-based feature selection (CFS) and the Taguchi-genetic algorithm (TGA) method were combined into a hybrid method, and the K-nearest neighbor (KNN) with the leave-one-out cross-validation (LOOCV) method served as a classifier for eleven classification profiles to calculate the classification accuracy. Experimental results show that the proposed method reduced redundant features effectively and achieved superior classification accuracy. The classification accuracy obtained by the proposed method was higher in ten out of the eleven gene expression data set test problems when compared to other classification methods from the literature.

[1]  Li-Yeh Chuang,et al.  Tabu Search and Binary Particle Swarm Optimization for Feature Selection Using Microarray Data , 2009, J. Comput. Biol..

[2]  Donald E. Grierson,et al.  Comparison among five evolutionary-based optimization algorithms , 2005, Adv. Eng. Informatics.

[3]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[4]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[5]  Igor V. Tetko,et al.  Gene selection from microarray data for cancer classification - a machine learning approach , 2005, Comput. Biol. Chem..

[6]  Gavin C. Cawley,et al.  Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers , 2003, Pattern Recognit..

[7]  Dorothea Heiss-Czedik,et al.  An Introduction to Genetic Algorithms. , 1997, Artificial Life.

[8]  Anil K. Jain,et al.  Dimensionality reduction using genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[9]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[10]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[11]  Xiaoxing Liu,et al.  An Entropy-based gene selection method for cancer classification using microarray data , 2005, BMC Bioinformatics.

[12]  Byung Ro Moon,et al.  Hybrid Genetic Algorithms for Feature Selection , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Sung-Bae Cho,et al.  An Evolutionary Algorithm Approach to Optimal Ensemble Classifiers for DNA Microarray Data Analysis , 2008, IEEE Transactions on Evolutionary Computation.

[14]  Tung-Kuan Liu,et al.  Hybrid Taguchi-genetic algorithm for global numerical optimization , 2004, IEEE Transactions on Evolutionary Computation.

[15]  Chong-Ho Choi,et al.  Input feature selection for classification problems , 2002, IEEE Trans. Neural Networks.

[16]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[17]  Hong Fang,et al.  Decision forest for classification of gene expression data , 2010, Comput. Biol. Medicine.

[18]  Goldberg,et al.  Genetic algorithms , 1993, Robust Control Systems with Genetic Algorithms.

[19]  Hongbin Zhang,et al.  Feature selection using tabu search method , 2002, Pattern Recognit..

[20]  Jie Gui,et al.  Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction , 2010, Comput. Biol. Medicine.

[21]  Zbigniew Michalewicz,et al.  Parameter Control in Evolutionary Algorithms , 2007, Parameter Setting in Evolutionary Algorithms.

[22]  Yanqing Zhang,et al.  Recursive Fuzzy Granulation for Gene Subsets Extraction and Cancer Classification , 2008, IEEE Transactions on Information Technology in Biomedicine.

[23]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Feature Subset Selection , 1977, IEEE Transactions on Computers.

[24]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[25]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[26]  M. Xiong,et al.  Biomarker Identification by Feature Wrappers , 2022 .

[27]  Yuin Wu,et al.  Taguchi Methods for Robust Design , 2000 .

[28]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[29]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[30]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[31]  S. Y. Sohn,et al.  Experimental study for the comparison of classifier combination methods , 2007, Pattern Recognit..

[32]  Zili Zhang,et al.  A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data , 2010, BMC Bioinformatics.

[33]  K. Deb,et al.  Reliable classification of two-class cancer data using evolutionary algorithms. , 2003, Bio Systems.

[34]  Qi Shen,et al.  Simultaneous genes and training samples selection by modified particle swarm optimization for gene expression data classification , 2009, Comput. Biol. Medicine.

[35]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[36]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[37]  Lei Zhang,et al.  Gene expression data classification using locally linear discriminant embedding , 2010, Comput. Biol. Medicine.

[38]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[39]  L-Y Chuang,et al.  Correlation-based Gene Selection and Classification Using Taguchi-BPSO , 2010, Methods of Information in Medicine.

[40]  Zbigniew Michalewicz,et al.  GAVaPS-a genetic algorithm with varying population size , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[41]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[42]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[43]  Li-Yeh Chuang,et al.  Improved binary PSO for feature selection using gene expression data , 2008, Comput. Biol. Chem..

[44]  A. Ghosh On optimum choice of k in nearest neighbor classification , 2006 .

[45]  Weixin Xie,et al.  A Novel Hybrid Feature Selection Method Based on IFSFFS and SVM for the Diagnosis of Erythemato-Squamous Diseases , 2010, WAPA.

[46]  Wen-Chin Chen,et al.  A neural network-based approach for dynamic quality prediction in a plastic injection molding process , 2008, Expert Syst. Appl..

[47]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[48]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .