A Novel Hybrid Feature Selection and Ensemble Learning Framework for Unbalanced Cancer Data Diagnosis With Transcriptome and Functional Proteomic

The high dimension, high redundancy and class imbalance of cancer multiple omics data are the main challenges for cancer diagnosis. Existing studies have neglected the role of functional proteomics in the occurrence and development of cancer. In this study, a novel hybrid feature selection and ensemble learning framework, referred to as the three-stage feature selection and twice-competitional ensemble learning method (TSFS-TCEM), is proposed for cancer diagnosis. Firstly, we combine the transcriptome and functional proteomics data to construct a multi-omics data on breast cancer, which is the first time to apply these combined biological data for diagnosing breast cancer. Secondly, the proposed method introduces multiple models during the feature selection and diagnostic model construction. The three-stage feature selections integrate the features from different types of data and the twice-competitional ensemble learning framework resolves the data imbalance problem suffer from a single classifier. The TSFS-TCEM achieves a diagnostic accuracy of 99.64%, outperforming all compared methods. In addition, the 5-fold cross-validation sensitivity, specificity and F-Measure of the method are above 99.63%.

[1]  Zhiyong Guo,et al.  Role of androgens on MCF-7 breast cancer cell growth and on the inhibitory effect of letrozole. , 2006, Cancer research.

[2]  A. Whetton,et al.  Proteomic Biomarkers for the Detection of Endometrial Cancer , 2019, Cancers.

[3]  A. H. El-Baz,et al.  Hybrid intelligent system-based rough set and ensemble classifier for breast cancer diagnosis , 2014, Neural Computing and Applications.

[4]  Mingquan Ye,et al.  Hybrid Method Based on Information Gain and Support Vector Machine for Gene Selection in Cancer Classification , 2017, Genom. Proteom. Bioinform..

[5]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[6]  Jason R. Myers,et al.  UPF1 helicase promotes TSN-mediated miRNA decay , 2017, Genes & development.

[7]  Seyed Mohammad Mirjalili,et al.  Whale optimization approaches for wrapper feature selection , 2018, Appl. Soft Comput..

[8]  Min Jin,et al.  Degree-Based Similarity Indexes for Identifying Potential miRNA-Disease Associations , 2020, IEEE Access.

[9]  Vinod Kumar Jain,et al.  Correlation feature selection based improved-Binary Particle Swarm Optimization for gene selection and cancer classification , 2018, Appl. Soft Comput..

[10]  E. Tokunaga,et al.  Activation of PI3K/Akt signaling and hormone resistance in breast cancer , 2006, Breast cancer.

[11]  W. Rui,et al.  Inhibition of mTORC1 by lncRNA H19 via disrupting 4E-BP1/Raptor interaction in pituitary tumours , 2018, Nature Communications.

[12]  R. Simpson,et al.  Secreted primary human malignant mesothelioma exosome signature reflects oncogenic cargo , 2016, Scientific Reports.

[13]  Alexander Schönhuth,et al.  Machine Learning-Based Ensemble Recursive Feature Selection of Circulating miRNAs for Cancer Tumor Classification , 2020, Cancers.

[14]  Brian J. Smith,et al.  Identification of an activation site in Bak and mitochondrial Bax triggered by antibodies , 2016, Nature Communications.

[15]  Radhakrishnan Nagarajan,et al.  An ensemble predictive modeling framework for breast cancer classification. , 2017, Methods.

[16]  Junming Xu,et al.  Integrative Proteomic Characterization of Human Lung Adenocarcinoma , 2020, Cell.

[17]  Nada Almugren,et al.  A Survey on Hybrid Feature Selection Methods in Microarray Gene Expression Data for Cancer Classification , 2019, IEEE Access.

[18]  Ujjwal Maulik,et al.  Analysis of Next-Generation Sequencing Data of miRNA for the Prediction of Breast Cancer , 2015, SEMCCO.

[19]  Huanhuan Chen,et al.  Robust twin boosting for feature selection from high-dimensional omics data with label noise , 2015, Inf. Sci..

[20]  Gang Wang,et al.  A Novel Hybrid Algorithm for Feature Selection Based on Whale Optimization Algorithm , 2019, IEEE Access.

[21]  Jun Wu,et al.  A deep learning-based multi-model ensemble method for cancer prediction , 2018, Comput. Methods Programs Biomed..

[22]  Zhen Yang,et al.  Application of EOS-ELM With Binary Jaya-Based Feature Selection to Real-Time Transient Stability Assessment Using PMU Data , 2017, IEEE Access.

[23]  R. Feng,et al.  Inhibition of epithelial to mesenchymal transition in metastatic breast carcinoma cells by c-Src suppression. , 2010, Acta biochimica et biophysica Sinica.

[24]  Saurabh Pal,et al.  Skin disease prediction using ensemble methods and a new hybrid feature selection technique , 2020, Iran J. Comput. Sci..

[25]  C. Leslie,et al.  Linking signaling pathways to transcriptional programs in breast cancer , 2014, Genome research.

[26]  G. Hutvagner,et al.  An isomiR expression panel based novel breast cancer classification approach using improved mutual information , 2018, BMC Medical Genomics.

[27]  Majdi M. Mafarja,et al.  Hybrid Whale Optimization Algorithm with simulated annealing for feature selection , 2017, Neurocomputing.

[28]  Jie Cai,et al.  Incorporating Clinical, Chemical and Biological Information for Predicting Small Molecule-microRNA Associations based on Non-negative Matrix Factorization. , 2020, IEEE/ACM transactions on computational biology and bioinformatics.

[29]  Wei Cheng,et al.  Construction of a specific SVM classifier and identification of molecular markers for lung adenocarcinoma based on lncRNA-miRNA-mRNA network , 2018, OncoTargets and therapy.

[30]  Richard E. Neapolitan,et al.  Discovering causal interactions using Bayesian network scoring and information gain , 2016, BMC Bioinformatics.

[31]  Cheng Liang,et al.  A graph regularized non-negative matrix factorization method for identifying microRNA-disease associations , 2018, Bioinform..

[32]  S. Cal,et al.  Cleavage of Fibulin-2 by the aggrecanases ADAMTS-4 and ADAMTS-5 contributes to the tumorigenic potential of breast cancer cells , 2017, Oncotarget.

[33]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[34]  S. Tsutsui,et al.  A loss of c-kit expression is associated with an advanced stage and poor prognosis in breast cancer , 2006, British Journal of Cancer.

[35]  Zhi Wei,et al.  Boosting support vector machines for cancer discrimination tasks , 2018, Comput. Biol. Medicine.

[36]  Fabian J Theis,et al.  Next-generation sequencing reveals novel differentially regulated mRNAs, lncRNAs, miRNAs, sdRNAs and a piRNA in pancreatic cancer , 2015, Molecular Cancer.

[37]  Jianyu Long,et al.  Evolving Deep Echo State Networks for Intelligent Fault Diagnosis , 2020, IEEE Transactions on Industrial Informatics.

[38]  Maximilian Fuchs,et al.  A Toolbox for Functional Analysis and the Systematic Identification of Diagnostic and Prognostic Gene Expression Signatures Combining Meta-Analysis and Machine Learning , 2019, Cancers.

[39]  Jialiang Yang,et al.  LRMCMDA: Predicting miRNA-Disease Association by Integrating Low-Rank Matrix Completion With miRNA and Disease Similarity Information , 2020, IEEE Access.

[40]  Wen Zhu,et al.  Identifying Potential miRNAs–Disease Associations With Probability Matrix Factorization , 2019, Front. Genet..

[41]  John Yearwood,et al.  A Hybrid Feature Selection With Ensemble Classification for Imbalanced Healthcare Data: A Case Study for Brain Tumor Diagnosis , 2016, IEEE Access.

[42]  Hossam Faris,et al.  Binary grasshopper optimisation algorithm approaches for feature selection problems , 2019, Expert Syst. Appl..

[43]  Anirban Roychowdhury,et al.  sigFeature: Novel Significant Feature Selection Method for Classification of Gene Expression Data Using Support Vector Machine and t Statistic , 2020, Frontiers in Genetics.

[44]  Zurinahni Zainol,et al.  Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm , 2020, Genes.

[45]  Jianyu Long,et al.  A Novel Sparse Echo Autoencoder Network for Data-Driven Fault Diagnosis of Delta 3-D Printers , 2020, IEEE Transactions on Instrumentation and Measurement.

[46]  Yiling Lu,et al.  TCPA v3.0: An Integrative Platform to Explore the Pan-Cancer Analysis of Functional Proteomic Data* , 2019, Molecular & Cellular Proteomics.