Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification

Microarray gene expression data are often accompanied by a large number of genes and a small number of samples. However, only a few of these genes are relevant to cancer, resulting in significant gene selection challenges. Hence, we propose a two-stage gene selection approach by combining extreme gradient boosting (XGBoost) and a multi-objective optimization genetic algorithm (XGBoost-MOGA) for cancer classification in microarray datasets. In the first stage, the genes are ranked using an ensemble-based feature selection using XGBoost. This stage can effectively remove irrelevant genes and yield a group comprising the most relevant genes related to the class. In the second stage, XGBoost-MOGA searches for an optimal gene subset based on the most relevant genes' group using a multi-objective optimization genetic algorithm. We performed comprehensive experiments to compare XGBoost-MOGA with other state-of-the-art feature selection methods using two well-known learning classifiers on 14 publicly available microarray expression datasets. The experimental results show that XGBoost-MOGA yields significantly better results than previous state-of-the-art algorithms in terms of various evaluation criteria, such as accuracy, F-score, precision, and recall.

[1]  Ashraful Islam,et al.  Adaptive Feature Selection and Classification of Colon Cancer From Gene Expression Data: an Ensemble Learning Approach , 2020, ICCA.

[2]  F. Zhan,et al.  The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. , 2003, The New England journal of medicine.

[3]  Kai Song,et al.  A steel property optimization model based on the XGBoost algorithm and improved PSO , 2020 .

[4]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[5]  Guo-Zheng Li,et al.  Gene selection by using an improved Fast Correlation-Based Filter , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[6]  Rakesh K Barot,et al.  Therapeutic effect of 0.1% Tacrolimus Eye Ointment in Allergic Ocular Diseases. , 2016, Journal of clinical and diagnostic research : JCDR.

[7]  LiTao,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004 .

[8]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[9]  Agung Fatwanto,et al.  A Self-Care Prediction Model for Children with Disability Based on Genetic Algorithm and Extreme Gradient Boosting , 2020, Mathematics.

[10]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[11]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[12]  Ying Liu,et al.  A Comparative Study on Feature Selection Methods for Drug Discovery , 2004, J. Chem. Inf. Model..

[13]  Han Zhang,et al.  Gene Expression Value Prediction Based on XGBoost Algorithm , 2019, Front. Genet..

[14]  Chuifeng Fan,et al.  Ube2S regulates Wnt/β-catenin signaling and promotes the progression of non-small cell lung cancer , 2020, International journal of medical sciences.

[15]  I. Laurenzana,et al.  Inhibition of ABCC6 Transporter Modifies Cytoskeleton and Reduces Motility of HepG2 Cells via Purinergic Pathway , 2020, Cells.

[16]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[17]  Manu Vardhan,et al.  A New Hybrid Feature Subset Selection Framework Based on Binary Genetic Algorithm and Information Theory , 2019, Int. J. Comput. Intell. Appl..

[18]  R V Jensen,et al.  Genome-wide expression profiling of human blood reveals biomarkers for Huntington's disease. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Yungho Leu,et al.  A novel hybrid feature selection method for microarray data analysis , 2011, Appl. Soft Comput..

[20]  Yu Xue,et al.  A hybrid feature selection algorithm for gene expression data classification , 2017, Neurocomputing.

[21]  Jun Ye,et al.  ACO Resampling: Enhancing the performance of oversampling methods for class imbalance classification , 2020, Knowl. Based Syst..

[22]  V. Bajic,et al.  DWFS: A Wrapper Feature Selection Tool Based on a Parallel Genetic Algorithm , 2015, PloS one.

[23]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[24]  Souad Guessoum,et al.  Fast correlation based filter combined with genetic algorithm and particle swarm on feature selection , 2017, 2017 5th International Conference on Electrical Engineering - Boumerdes (ICEE-B).

[25]  M. M. Parisi,et al.  TRF1 as a major contributor for telomeres' shortening in the context of obesity , 2018, Free radical biology & medicine.

[26]  R. Perez,et al.  Radiation Dose Exposure for Lumbar Transforaminal Epidural Steroid Injections and Facet Joint Blocks Under CT vs. Fluoroscopic Guidance , 2018, Pain practice : the official journal of World Institute of Pain.

[27]  R. K. Agrawal,et al.  Microarray Gene-expression Data Classification using Less Gene Expressions by Combining Feature Selection Methods and Classifiers , 2013 .

[28]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[29]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[30]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[31]  Randal S. Olson,et al.  Benchmarking Relief-Based Feature Selection Methods , 2017, ArXiv.

[32]  Ali Haidar,et al.  A Swarm based Optimization of the XGBoost Parameters , 2019, Aust. J. Intell. Inf. Process. Syst..

[33]  Ram Sarkar,et al.  Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods , 2018, Medical & Biological Engineering & Computing.

[34]  G. Cai,et al.  CD59 is a potential biomarker of esophageal squamous cell carcinoma radioresistance by affecting DNA repair , 2018, Cell Death & Disease.

[35]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[36]  Zexuan Zhu,et al.  Markov blanket-embedded genetic algorithm for gene selection , 2007, Pattern Recognit..

[37]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[38]  Li Zhang,et al.  Feature clustering based support vector machine recursive feature elimination for gene selection , 2018, Applied Intelligence.

[39]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[40]  J. Pratt Remarks on Zeros and Ties in the Wilcoxon Signed Rank Procedures , 1959 .

[41]  Amr Badr,et al.  A Nested Genetic Algorithm for feature selection in high-dimensional cancer Microarray datasets , 2019, Expert Syst. Appl..

[42]  Jun Ni,et al.  An Improved Ensemble Learning Method for Classifying High-Dimensional and Imbalanced Biomedicine Data , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[43]  Mohamed Limam,et al.  Robust ensemble feature selection for high dimensional data sets , 2013, 2013 International Conference on High Performance Computing & Simulation (HPCS).

[44]  E. R. Vimina,et al.  Improving Recurrence Prediction Accuracy of Ovarian Cancer Using Multi-phase Feature Selection Methodology , 2020, Appl. Artif. Intell..

[45]  Randal S. Olson,et al.  Benchmarking Relief-Based Feature Selection Methods , 2017, J. Biomed. Informatics.

[46]  Hugues Bersini,et al.  A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[47]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Vassilis P. Plagianakos,et al.  Pathway analysis using XGBoost classification in Biomedical Data , 2018, SETN.

[49]  K. R. Kavitha,et al.  Applying improved svm classifier for leukemia cancer classification using FCBF , 2017, 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[50]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[51]  Geng Tian,et al.  A Novel XGBoost Method to Infer the Primary Lesion of 20 Solid Tumor Types From Gene Expression Data , 2021, Frontiers in Genetics.

[52]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[53]  Jinsong Leng,et al.  A genetic Algorithm-Based feature selection , 2014 .

[54]  Duncan Fyfe Gillies,et al.  A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data , 2015, Adv. Bioinformatics.

[55]  Rana Abdu-Aljabar,et al.  A Comparative analysis study of lung cancer detection and relapse prediction using XGBoost classifier , 2021 .

[56]  Z. Sun,et al.  Combined FV and FVIII deficiency (F5F8D) in a Chinese family with a novel missense mutation in MCFD2 gene , 2014, Haemophilia : the official journal of the World Federation of Hemophilia.

[57]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[58]  G. Jeyakumar,et al.  Hybrid feature selection using micro genetic algorithm on microarray gene expression data , 2019, J. Intell. Fuzzy Syst..

[59]  Mingcai Zhao,et al.  Implication of Ataxia-Telangiectasia-mutated kinase in epithelium-mesenchyme transition. , 2021, Carcinogenesis.

[60]  M. Hasan Shaheed,et al.  Development of a two-stage gene selection method that incorporates a novel hybrid approach using the cuckoo optimization algorithm and harmony search for cancer classification , 2017, J. Biomed. Informatics.