MGRFE: Multilayer Recursive Feature Elimination Based on an Embedded Genetic Algorithm for Cancer Classification

Microarray gene expression data have become a topic of great interest for cancer classification and for further research in the field of bioinformatics. Nonetheless, due to the "large p, small n" paradigm of limited biosamples and high-dimensional data, gene selection is becoming a demanding task, which is aimed at selecting a minimal number of discriminatory genes associated closely with a phenotype. Feature or gene selection is still a challenging problem owing to its nondeterministic polynomial time complexity and thus most of the existing feature selection algorithms utilize heuristic rules. A multilayer recursive feature elimination method based on an embedded integer-coded genetic algorithm, MGRFE, is proposed here, which is aimed at selecting the gene combination with minimal size and maximal information. On the basis of 19 benchmark microarray datasets including multiclass and imbalanced datasets, MGRFE outperforms state-of-the-art feature selection algorithms with better cancer classification accuracy and a smaller selected gene number. MGRFE could be regarded as a promising feature selection method for high-dimensional datasets especially gene expression data. Moreover, the genes selected by MGRFE have close biological relevance to cancer phenotypes. The source code of our proposed algorithm and all the 19 datasets used in this paper are available at https://github.com/Pengeace/MGRFE-GaRFE.

[1]  N. Hu,et al.  Comparison of Global Gene Expression of Gastric Cardia and Noncardia Cancers from a High-Risk Population in China , 2013, PloS one.

[2]  Nikhil R. Pal,et al.  Discovering biomarkers from gene expression data for predicting cancer subgroups using neural networks and relational fuzzy clustering , 2007, BMC Bioinformatics.

[3]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[4]  Li M. Fu,et al.  Evaluation of gene importance in microarray data based upon probability of selection , 2005, BMC Bioinformatics.

[5]  Gamal Attiya,et al.  Classification of human cancer diseases by gene expression profiles , 2017, Appl. Soft Comput..

[6]  Stjepan Oreski,et al.  Genetic algorithm-based heuristic for feature selection in credit risk assessment , 2014, Expert Syst. Appl..

[7]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[8]  Madhubanti Maitra,et al.  Gene selection from microarray gene expression data for classification of cancer subgroups employing PSO and adaptive K-nearest neighborhood technique , 2015, Expert Syst. Appl..

[9]  Cong Jin,et al.  Attribute selection method based on a hybrid BPNN and PSO algorithms , 2012, Appl. Soft Comput..

[10]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[11]  Beatriz A. Garro,et al.  Classification of DNA microarrays using artificial neural networks and ABC algorithm , 2016, Appl. Soft Comput..

[12]  Tzu-Tsung Wong,et al.  A Probabilistic mechanism based on clustering analysis and distance measure for subset gene selection , 2010, Expert Syst. Appl..

[13]  Xiang Li,et al.  Initialization strategies to enhancing the performance of genetic algorithms for the p-median problem , 2011, Comput. Ind. Eng..

[14]  F. Zhan,et al.  The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. , 2003, The New England journal of medicine.

[15]  Mohammad Sohel Rahman,et al.  Gene selection for cancer classification with the help of bees , 2016, BMC Medical Genomics.

[16]  José M Ferro,et al.  TTC7B Emerges as a Novel Risk Factor for Ischemic Stroke Through the Convergence of Several Genome-Wide Approaches , 2012, Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism.

[17]  Cesare Furlanello,et al.  An accelerated procedure for recursive feature ranking on microarray data , 2003, Neural Networks.

[18]  William Stafford Noble,et al.  The effect of replication on gene expression microarray experiments , 2003, Bioinform..

[19]  Zijiang Yang,et al.  PLS-Based Gene Selection and Identification of Tumor-Specific Genes , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[20]  Guoqing Diao,et al.  Assessing Genome-Wide Statistical Significance for Large p Small n Problems , 2013, Genetics.

[21]  Richard J. Fox,et al.  A two-sample Bayesian t-test for microarray data , 2006, BMC Bioinformatics.

[22]  Guoqing Wang,et al.  Gene expression profile based classification models of psoriasis. , 2014, Genomics.

[23]  Lipo Wang,et al.  A Modified T-test Feature Selection Method and Its Application on the HapMap Genotype Data , 2008, Genom. Proteom. Bioinform..

[24]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[25]  Jianzhong Li,et al.  A stable gene selection in microarray data analysis , 2006, BMC Bioinformatics.

[26]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Lothar Thiele,et al.  A Comparison of Selection Schemes used in Genetic Algorithms , 1995 .

[28]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[29]  Wei-Chung Cheng,et al.  Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm , 2014, BMC Bioinformatics.

[30]  George C. Runger,et al.  Feature selection via regularized trees , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[31]  X. Cui,et al.  Statistical tests for differential expression in cDNA microarray experiments , 2003, Genome Biology.

[32]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[33]  Guoqing Wang,et al.  McTwo: a two-step feature selection algorithm based on maximal information coefficient , 2016, BMC Bioinformatics.

[34]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[35]  M. Balafar,et al.  Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts. , 2017, Genomics.

[36]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[37]  Bauke Ylstra,et al.  Comprehensive genomic meta-analysis identifies intra-tumoural stroma as a predictor of survival in patients with gastric cancer , 2012, Gut.

[38]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Martin Jung,et al.  A Guided Hybrid Genetic Algorithm for Feature Selection with Expensive Cost Functions , 2013, ICCS.

[40]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[41]  Michael I. Jordan,et al.  Simultaneous classification and relevant feature identification in high-dimensional spaces: application to molecular profiling data , 2003, Signal Process..

[42]  B. Chandra,et al.  Robust approach for estimating probabilities in Naïve-Bayes Classifier for gene expression data , 2011, Expert Syst. Appl..

[43]  Yuanyuan Ding,et al.  Improving the Performance of SVM-RFE to Select Genes in Microarray Data , 2006, BMC Bioinformatics.

[44]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[45]  A. Brivanlou,et al.  Molecular signature of human embryonic stem cells and its comparison with the mouse. , 2003, Developmental biology.

[46]  T. Aruldoss Albert Victoire,et al.  Design of fuzzy expert system for microarray data classification using a novel Genetic Swarm Algorithm , 2012, Expert Syst. Appl..

[47]  Satoru Miyano,et al.  A Top-r Feature Selection Algorithm for Microarray Gene Expression Data , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[48]  Martin J. Hessner,et al.  Transcriptional Signatures as a Disease-Specific and Predictive Inflammatory Biomarker for Type 1 Diabetes , 2012, Genes and Immunity.

[49]  Zarita Zainuddin,et al.  Reliable multiclass cancer classification of microarray gene expression profiles using an improved wavelet neural network , 2011, Expert Syst. Appl..

[50]  Hugues Bersini,et al.  A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[51]  Mohd Saberi Mohamad,et al.  A Modified Binary Particle Swarm Optimization for Selecting the Small Subset of Informative Genes From Gene Expression Data , 2011, IEEE Transactions on Information Technology in Biomedicine.

[52]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[53]  R. Gentleman,et al.  Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. , 2004, Blood.

[54]  Xuehua Li,et al.  Kernel based nonlinear dimensionality reduction for microarray gene expression data analysis , 2009, Expert Syst. Appl..

[55]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[56]  Ghada Hany Badr,et al.  Genetic Bee Colony (GBC) algorithm: A new gene selection method for microarray cancer classification , 2015, Comput. Biol. Chem..

[57]  U. Alon,et al.  Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. , 2001, Cancer research.

[58]  Yuh-Min Chen,et al.  Gene selection and sample classification on microarray data based on adaptive genetic algorithm/k-nearest neighbor method , 2011, Expert Syst. Appl..

[59]  Harry Zhang,et al.  Exploring Conditions For The Optimality Of Naïve Bayes , 2005, Int. J. Pattern Recognit. Artif. Intell..

[60]  Wei Kong,et al.  Hybrid particle swarm optimization and tabu search approach for selecting genes for tumor classification using gene expression data , 2008, Comput. Biol. Chem..