An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples

Background Machine learning is one kind of machine intelligence technique that learns from data and detects inherent patterns from large, complex datasets. Due to this capability, machine learning techniques are widely used in medical applications, especially where large-scale genomic and proteomic data are used. Cancer classification based on bio-molecular profiling data is a very important topic for medical applications since it improves the diagnostic accuracy of cancer and enables a successful culmination of cancer treatments. Hence, machine learning techniques are widely used in cancer detection and prognosis. Methods In this article, a new ensemble machine learning classification model named Multiple Filtering and Supervised Attribute Clustering algorithm based Ensemble Classification model (MFSAC-EC) is proposed which can handle class imbalance problem and high dimensionality of microarray datasets. This model first generates a number of bootstrapped datasets from the original training data where the oversampling procedure is applied to handle the class imbalance problem. The proposed MFSAC method is then applied to each of these bootstrapped datasets to generate sub-datasets, each of which contains a subset of the most relevant/informative attributes of the original dataset. The MFSAC method is a feature selection technique combining multiple filters with a new supervised attribute clustering algorithm. Then for every sub-dataset, a base classifier is constructed separately, and finally, the predictive accuracy of these base classifiers is combined using the majority voting technique forming the MFSAC-based ensemble classifier. Also, a number of most informative attributes are selected as important features based on their frequency of occurrence in these sub-datasets. Results To assess the performance of the proposed MFSAC-EC model, it is applied on different high-dimensional microarray gene expression datasets for cancer sample classification. The proposed model is compared with well-known existing models to establish its effectiveness with respect to other models. From the experimental results, it has been found that the generalization performance/testing accuracy of the proposed classifier is significantly better compared to other well-known existing models. Apart from that, it has been also found that the proposed model can identify many important attributes/biomarker genes.

[1]  Alireza Osareh,et al.  An Efficient Ensemble Learning Method for Gene Microarray Classification , 2013, BioMed research international.

[2]  M. Hasan Shaheed,et al.  Development of a two-stage gene selection method that incorporates a novel hybrid approach using the cuckoo optimization algorithm and harmony search for cancer classification , 2017, J. Biomed. Informatics.

[3]  Jesús S. Aguilar-Ruiz,et al.  Incremental wrapper-based gene selection from microarray data for cancer classification , 2006, Pattern Recognit..

[4]  Lin Zhang,et al.  Cancer Characteristic Gene Selection via Sample Learning Based on Deep Sparse Filtering , 2018, Scientific Reports.

[5]  K. S. Adewole,et al.  Microarray cancer feature selection: Review, challenges and research directions , 2020, International Journal of Cognitive Computing in Engineering.

[6]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[7]  K. Tonissen,et al.  Thioredoxin and Cancer: A Role for Thioredoxin in all States of Tumor Oxygenation , 2010, Cancers.

[8]  K. Welte,et al.  Ultra-Sensitive CSF3R Deep Sequencing in Patients With Severe Congenital Neutropenia , 2019, Front. Immunol..

[9]  Xibei Yang,et al.  Interval-valued analysis for discriminative gene selection and tissue sample classification using microarray data. , 2013, Genomics.

[10]  Ying Su,et al.  The Krüppel-like factor 9 ( KLF 9 ) network in HEC-1-A endometrial carcinoma cells suggests the carcinogenic potential of dys-regulated KLF 9 expression , 2015 .

[11]  Musa H. Asyali,et al.  Gene Expression Profile Classification: A Review , 2006 .

[12]  A. Giuliano,et al.  Critical protein GAPDH and its regulatory mechanisms in cancer cells , 2015, Cancer biology & medicine.

[13]  Samiran Chattopadhyay,et al.  A novel distance-based iterative sequential KNN algorithm for estimation of missing values in microarray gene expression data , 2016, Int. J. Bioinform. Res. Appl..

[14]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[15]  P. Maji,et al.  Relevant and Significant Supervised Gene Clusters for Microarray Cancer Classification , 2012, IEEE Transactions on NanoBioscience.

[16]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  Graham R. Ball,et al.  Exploration of leukemia gene regulatory networks using a systems biology approach , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[19]  Mário A. T. Figueiredo,et al.  An unsupervised approach to feature discretization and selection , 2012, Pattern Recognit..

[20]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[21]  Yukyee Leung,et al.  A Multiple-Filter-Multiple-Wrapper Approach to Gene Selection and Microarray Data Classification , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[23]  L. Breiman,et al.  Submodel selection and evaluation in regression. The X-random case , 1992 .

[24]  Ash A. Alizadeh,et al.  Rheumatoid arthritis is a heterogeneous disease: evidence for differences in the activation of the STAT-1 pathway between rheumatoid tissues. , 2003, Arthritis and rheumatism.

[25]  A. Tefferi,et al.  Chronic neutrophilic leukemia: new science and new diagnostic criteria , 2018, Blood Cancer Journal.

[26]  Pratyay Kuila,et al.  Feature Selection from Microarray Data based on Deep Learning Approach , 2020, 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT).

[27]  Vanesa Segovia Bucheli,et al.  A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data , 2020, PeerJ Comput. Sci..

[28]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[29]  Zili Zhang,et al.  A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data , 2010, BMC Bioinformatics.

[30]  Yixin Chen,et al.  Learning accurate and interpretable models based on regularized random forests regression , 2014, BMC Systems Biology.

[31]  A.K.C. Wong,et al.  Attribute clustering for grouping, selection, and classification of gene expression data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[33]  Shun-Fa Yang,et al.  TIMP-3 as a therapeutic target for cancer , 2019, Therapeutic advances in medical oncology.

[34]  Verónica Bolón-Canedo,et al.  An ensemble of filters and classifiers for microarray data classification , 2012, Pattern Recognit..

[35]  Lipo Wang,et al.  A Modified T-test Feature Selection Method and Its Application on the HapMap Genotype Data , 2008, Genom. Proteom. Bioinform..

[36]  Jiawei Han,et al.  Generalized Fisher Score for Feature Selection , 2011, UAI.

[37]  Yoo Hong Min,et al.  Myeloperoxidase Expression in Acute Myeloid Leukemia Helps Identifying Patients to Benefit from Transplant , 2012, Yonsei medical journal.

[38]  L. Greller,et al.  Detecting selective expression of genes and proteins. , 1999, Genome research.

[39]  C. Wijbrandts,et al.  Rheumatoid arthritis subtypes identified by genomic profiling of peripheral blood cells: assignment of a type I interferon signature in a subpopulation of patients , 2007, Annals of the rheumatic diseases.

[40]  Vladimir Naumovich Vapni The Nature of Statistical Learning Theory , 1995 .

[41]  Matangini Chattopadhyay,et al.  Comparative Performance Analysis of Different Measures to Select Disease Related Informative Genes from Microarray Gene Expression Data , 2019 .

[42]  F. A. Lagunas-Rangel,et al.  Acute Myeloid Leukemia—Genetic Alterations and Their Clinical Prognosis , 2017, International journal of hematology-oncology and stem cell research.

[43]  Jia Liu,et al.  MEF2 signaling and human diseases , 2017, Oncotarget.

[44]  N. Steuerwald,et al.  Altered expression of CSF3R splice variants impacts signal response and is associated with SRSF2 mutations , 2019, Leukemia.

[45]  Habibollah Haron,et al.  Supervised, Unsupervised, and Semi-Supervised Feature Selection: A Review on Gene Selection , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[46]  L. Handschuh Not Only Mutations Matter: Molecular Picture of Acute Myeloid Leukemia Emerging from Transcriptome Studies , 2019, Journal of oncology.

[47]  Hongkai Ji,et al.  Kruppel-like Factor-9 (KLF9) Inhibits Glioblastoma Stemness through Global Transcription Repression and Integrin α6 Inhibition* , 2014, The Journal of Biological Chemistry.

[48]  Yimin Xiong,et al.  Identification of candidate colon cancer biomarkers by applying a random forest approach on microarray data. , 2012, Oncology reports.

[49]  Sergios Theodoridis,et al.  Pattern Recognition, Fourth Edition , 2008 .

[50]  M. Matúšková,et al.  ALDH1A inhibition sensitizes colon cancer cells to chemotherapy , 2018, BMC Cancer.

[51]  Ching Wei Wang,et al.  New Ensemble Machine Learning Method for Classification and Prediction on Gene Expression Data , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[52]  N. Hamdy,et al.  Expression of thioredoxin-1 (TXN) and its relation with oxidative DNA damage and treatment outcome in adult AML and ALL: A comparative study , 2016, Hematology.

[53]  Limei Liu,et al.  MEF2D Transduces Microenvironment Stimuli to ZEB1 to Promote Epithelial-Mesenchymal Transition and Metastasis in Colorectal Cancer. , 2016, Cancer research.

[54]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[55]  V. Raj,et al.  Krüppel-like factor 9 (KLF9) prevents colorectal cancer through inhibition of interferon-related signaling. , 2015, Carcinogenesis.

[56]  Nada Almugren,et al.  A Survey on Hybrid Feature Selection Methods in Microarray Gene Expression Data for Cancer Classification , 2019, IEEE Access.

[57]  Peter Bühlmann,et al.  Supervised clustering of genes , 2002, Genome Biology.

[58]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[59]  M. Hughson,et al.  CD79a expression in acute myeloid leukemia t(8;21) and the importance of cytogenetics in the diagnosis of leukemias with immunophenotypic ambiguity. , 2005, Cancer genetics and cytogenetics.

[60]  A. Schambach,et al.  Cooperating, congenital neutropenia–associated Csf3r and Runx1 mutations activate pro-inflammatory signaling and inhibit myeloid differentiation of mouse HSPCs , 2020, Annals of Hematology.

[61]  Parham Moradi,et al.  An unsupervised feature selection algorithm based on ant colony optimization , 2014, Eng. Appl. Artif. Intell..

[62]  Tao Liu,et al.  Efficient feature selection and classification for microarray data , 2018, PloS one.

[63]  T. Léveillard,et al.  Cell Signaling with Extracellular Thioredoxin and Thioredoxin-Like Proteins: Insight into Their Mechanisms of Action , 2017, Oxidative medicine and cellular longevity.

[64]  Shu-Lin Wang,et al.  Finding minimum gene subsets with heuristic breadth-first search algorithm for robust tumor classification , 2012, BMC Bioinformatics.

[65]  Youping Deng,et al.  Gene selection and classification for cancer microarray data based on machine learning and similarity measures , 2011, BMC Genomics.

[66]  A. Mobasheri,et al.  Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. , 2013, Omics : a journal of integrative biology.

[67]  A. Seifalian,et al.  Role of insulin-like growth factor binding protein-4 in prevention of colon cancer , 2007, World journal of surgical oncology.

[68]  Y. Bai,et al.  Clinicopathologic significance of BAG1 and TIMP3 expression in colon carcinoma. , 2007, World journal of gastroenterology.

[69]  Songnian Hu,et al.  Dynamic transcriptomes of human myeloid leukemia cells. , 2013, Genomics.

[70]  C. Brancolini,et al.  MEF2 and the tumorigenic process, hic sunt leones. , 2018, Biochimica et biophysica acta. Reviews on cancer.

[71]  F. Wang,et al.  CSF3R Mutations are frequently associated with abnormalities of RUNX1, CBFB, CEBPA, and NPM1 genes in acute myeloid leukemia , 2018, Cancer.

[72]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[73]  Lan Liu,et al.  Long non-coding RNA MBNL1-AS1 regulates proliferation, migration, and invasion of cancer stem cells in colon cancer by interacting with MYL9 via sponging microRNA-412-3p. , 2020, Clinics and research in hepatology and gastroenterology.

[74]  Ying Su,et al.  Reproductive Biology and Endocrinology Open Access the Krüppel-like Factor 9 (klf9) Network in Hec-1-a Endometrial Carcinoma Cells Suggests the Carcinogenic Potential of Dys-regulated Klf9 Expression , 2022 .

[75]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.

[76]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[77]  Marcel J. T. Reinders,et al.  Random subspace method for multivariate feature selection , 2006, Pattern Recognit. Lett..

[78]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[79]  Anirban Mukherjee,et al.  Cancer Classification from Gene Expression Data by NPPC Ensemble , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[80]  Parham Moradi,et al.  Gene selection for microarray data classification using a novel ant colony optimization , 2015, Neurocomputing.

[81]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[82]  Constantine Kotropoulos,et al.  Feature Selection Based on Mutual Correlation , 2006, CIARP.

[83]  Chris Eliasmith,et al.  Hyperopt-Sklearn: Automatic Hyperparameter Configuration for Scikit-Learn , 2014, SciPy.

[84]  Anne-Mette K. Hein,et al.  Alternative Splicing in Colon, Bladder, and Prostate Cancer Identified by Exon Array Analysis*S , 2008, Molecular & Cellular Proteomics.

[85]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[86]  Yuqin Liu,et al.  ALDH1A3 affects colon cancer in vitro proliferation and invasion depending on CXCR4 status , 2017, British Journal of Cancer.

[87]  Wei Liang,et al.  Gene Selection Using Locality Sensitive Laplacian Score , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[88]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[89]  Chris H. Q. Ding,et al.  Minimum Redundancy Feature Selection from Microarray Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[90]  Christos Sotiriou,et al.  Bringing molecular prognosis and prediction to the clinic. , 2005, Clinical breast cancer.

[91]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[92]  P. Gardner,et al.  Quantum Cascade Laser Spectral Histopathology: Breast Cancer Diagnostics Using High Throughput Chemical Imaging. , 2017, Analytical chemistry.

[93]  Tyson A. Clark,et al.  Alternative splicing and differential gene expression in colon cancer detected by a whole genome exon array , 2006, BMC Genomics.

[94]  J. Yun,et al.  Over-expression of GAPDH in human colorectal carcinoma as a preferred target of 3-Bromopyruvate Propyl Ester , 2012, Journal of Bioenergetics and Biomembranes.

[95]  Jerzy Stefanowski,et al.  Extending Bagging for Imbalanced Data , 2013, CORES.

[96]  Dhruba Kumar Bhattacharyya,et al.  Classification of microarray cancer data using ensemble approach , 2013, Network Modeling Analysis in Health Informatics and Bioinformatics.

[97]  G. Botchkina,et al.  Phenotypic subpopulations of metastatic colon cancer stem cells: genomic analysis. , 2009, Cancer genomics & proteomics.

[98]  I. B. Borel Rinkes,et al.  ALDH1A1 expression is associated with poor differentiation, ‘right-sidedness’ and poor survival in human colorectal cancer , 2018, PloS one.

[99]  S. Widen,et al.  Episomal expression of sense and antisense insulin-like growth factor (IGF)-binding protein-4 complementary DNA alters the mitogenic response of a human colon cancer cell line (HT-29) by mechanisms that are independent of and dependent upon IGF-I. , 1994, Cancer research.

[100]  T. Rohan,et al.  Role of the insulin-like growth factor family in cancer development and progression. , 2000, Journal of the National Cancer Institute.

[101]  M. Balafar,et al.  Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts. , 2017, Genomics.