A balanced iterative random forest for gene selection from microarray data

BackgroundThe wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging to each class is not the same. With this kind of dataset, biologists need to identify a small number of informative genes that can be used as biomarkers for a disease.ResultsThis paper introduces a Balanced Iterative Random Forest (BIRF) algorithm to select the most relevant genes for a disease from imbalanced high-throughput gene expression microarray data. Balanced iterative random forest is applied on four cancer microarray datasets: a childhood leukaemia dataset, which represents the main target of this paper, collected from The Children’s Hospital at Westmead, NCI 60, a Colon dataset and a Lung cancer dataset. The results obtained by BIRF are compared to those of Support Vector Machine-Recursive Feature Elimination (SVM-RFE), Multi-class SVM-RFE (MSVM-RFE), Random Forest (RF) and Naive Bayes (NB) classifiers. The results of the BIRF approach outperform these state-of-the-art methods, especially in the case of imbalanced datasets. Experiments on the childhood leukaemia dataset show that a 7% ∼ 12% better accuracy is achieved by BIRF over MSVM-RFE with the ability to predict patients in the minor class. The informative biomarkers selected by the BIRF algorithm were validated by repeating training experiments three times to see whether they are globally informative, or just selected by chance. The results show that 64% of the top genes consistently appear in the three lists, and the top 20 genes remain near the top in the other three lists.ConclusionThe designed BIRF algorithm is an appropriate choice to select genes from imbalanced high-throughput gene expression microarray data. BIRF outperforms the state-of-the-art methods, especially the ability to handle the class-imbalanced data. Moreover, the analysis of the selected genes also provides a way to distinguish between the predictive genes and those that only appear to be predictive.

[1]  A. Zelenetz,et al.  Acute lymphoblastic leukemia. , 2019, Journal of the National Comprehensive Cancer Network : JNCCN.

[2]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Jinyan Li,et al.  Using Rules to Analyse Bio-medical Data: A Comparison between C4.5 and PCL , 2003, WAIM.

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  Nicu Sebe,et al.  Emotion recognition using a Cauchy Naive Bayes classifier , 2002, Object recognition supported by user interaction for service robots.

[6]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[7]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[8]  Meng Li,et al.  Stream Operators for Querying Data Streams , 2005, WAIM.

[9]  Gilles Blanchard,et al.  Early stopping for mutual information based feature selection , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[10]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[11]  Jae-Woo Chang,et al.  Advances in Web-Age Information Management , 2001, Lecture Notes in Computer Science.

[12]  John Loughrey,et al.  Using Early-Stopping to Avoid Overfitting in Wrapper-Based Feature Selection Employing Stochastic Search , 2005 .

[13]  A. Levitzki,et al.  Inhibition of acute lymphoblastic leukaemia by a Jak-2 inhibitor , 1996, Nature.

[14]  InzaIñaki,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004 .

[15]  Ling Feng,et al.  Advances in Web-Age Information Management , 2004, Lecture Notes in Computer Science.

[16]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[17]  Shigeo Abe,et al.  Fuzzy support vector machines for multiclass problems , 2002, ESANN.

[18]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[19]  Robin Foà,et al.  T-cell acute lymphoblastic leukemia , 2009, Haematologica.

[20]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Ulrich Göbel,et al.  Long-term outcome in children with relapsed ALL by risk-stratified salvage therapy: results of trial acute lymphoblastic leukemia-relapse study of the Berlin-Frankfurt-Münster Group 87. , 2005, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[22]  J. S. Marron,et al.  Distance-Weighted Discrimination , 2007 .

[23]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[24]  F. Azuaje,et al.  Multiple SVM-RFE for gene selection in cancer classification with expression data , 2005, IEEE Transactions on NanoBioscience.

[25]  WestonJason,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002 .

[26]  A. Brazma,et al.  Gene expression data analysis , 2000, FEBS letters.

[27]  Louise C. Showe,et al.  Bioinformatics Original Paper Combining Multi-species Genomic Data for Microrna Identification Using a Naı¨ve Bayes Classifier , 2022 .

[28]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[29]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[30]  Mohd Saberi Mohamad,et al.  Random forest for gene selection and microarray data classification , 2011, Bioinformation.