Ensemble Methods with Voting Protocols Exhibit Superior Performance for Predicting Cancer Clinical Endpoints and Providing More Complete Coverage of Disease-Related Genes

In genetic data modeling, the use of a limited number of samples for modeling and predicting, especially well below the attribute number, is difficult due to the enormous number of genes detected by a sequencing platform. In addition, many studies commonly use machine learning methods to evaluate genetic datasets to identify potential disease-related genes and drug targets, but to the best of our knowledge, the information associated with the selected gene set was not thoroughly elucidated in previous studies. To identify a relatively stable scheme for modeling limited samples in the gene datasets and reveal the information that they contain, the present study first evaluated the performance of a series of modeling approaches for predicting clinical endpoints of cancer and later integrated the results using various voting protocols. As a result, we proposed a relatively stable scheme that used a set of methods with an ensemble algorithm. Our findings indicated that the ensemble methodologies are more reliable for predicting cancer prognoses than single machine learning algorithms as well as for gene function evaluating. The ensemble methodologies provide a more complete coverage of relevant genes, which can facilitate the exploration of cancer mechanisms and the identification of potential drug targets.

[1]  Yutaka Shimada,et al.  Prediction of survival in patients with esophageal carcinoma using artificial neural networks , 2005, Cancer.

[2]  Yanda Li,et al.  Inferring pathway crosstalk networks using gene set co-expression signatures. , 2013, Molecular bioSystems.

[3]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.

[4]  Menglong Li,et al.  A new strategy for exploring the hierarchical structure of cancers by adaptively partitioning functional modules from gene expression network , 2016, Scientific Reports.

[5]  Bart De Moor,et al.  Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks , 2006, ISMB.

[6]  M. Ghazisaeedi,et al.  Improving the Prediction of Survival in Cancer Patients by Using Machine Learning Techniques: Experience of Gene Expression Data: A Narrative Review , 2017, Iranian journal of public health.

[7]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[8]  Tom C. Freeman,et al.  Improved grading and survival prediction of human astrocytic brain tumors by artificial neural network analysis of gene expression microarray data , 2008, Molecular Cancer Therapeutics.

[9]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[10]  Li He,et al.  Bipartite network analysis reveals metabolic gene expression profiles that are highly associated with the clinical outcomes of acute myeloid leukemia , 2017, Comput. Biol. Chem..

[11]  Allison P. Heath,et al.  Toward a Shared Vision for Cancer Genomic Data. , 2016, The New England journal of medicine.

[12]  Sanghyun Park,et al.  Integrative Gene Network Construction to Analyze Cancer Recurrence Using Semi-Supervised Learning , 2014, PloS one.

[13]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[14]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[15]  M. Urashima,et al.  Profiling gene expression ratios of paired cancerous and normal tissue predicts relapse of esophageal squamous cell carcinoma. , 2003, Cancer research.

[16]  Bernhard Pfahringer,et al.  Locally Weighted Naive Bayes , 2002, UAI.

[17]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[18]  Ron Kohavi,et al.  The Power of Decision Tables , 1995, ECML.

[19]  Z. Wen,et al.  Discovery of Molecular Mechanisms of Traditional Chinese Medicinal Formula Si-Wu-Tang Using Gene Expression Microarray and Connectivity Map , 2011, PloS one.

[20]  W. V. van IJcken,et al.  Gene Expression-Based Classification of Non-Small Cell Lung Carcinomas and Survival Prediction , 2010, PloS one.

[21]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[22]  Juan Zhang,et al.  Improving the prediction of chemotherapeutic sensitivity of tumors in breast cancer via optimizing the selection of candidate genes , 2014, Comput. Biol. Chem..

[23]  Hung-Wen Chiu,et al.  Risk classification of cancer survival using ANN with gene expression data from multiple laboratories , 2014, Comput. Biol. Medicine.

[24]  H. Mahjub,et al.  Bayesian Survival Analysis of High-Dimensional Microarray Data for Mantle Cell Lymphoma Patients. , 2016, Asian Pacific journal of cancer prevention : APJCP.

[25]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[26]  Ya Zhang,et al.  A gene signature for breast cancer prognosis using support vector machine , 2012, 2012 5th International Conference on BioMedical Engineering and Informatics.

[27]  Menglong Li,et al.  Identifying oncogenes as features for clinical cancer prognosis by Bayesian nonparametric variable selection algorithm , 2015 .

[28]  H. Altay Güvenir,et al.  Classification by Voting Feature Intervals , 1997, ECML.

[29]  Taizo Hanai,et al.  Fuzzy Neural Network Applied to Gene Expression Profiling for Predicting the Prognosis of Diffuse Large B‐cell Lymphoma , 2002, Japanese journal of cancer research : Gann.

[30]  G. von Heijne,et al.  Tissue-based map of the human proteome , 2015, Science.

[31]  Zhining Wen,et al.  Identifying the Gene Signatures from Gene-Pathway Bipartite Network Guarantees the Robust Model Performance on Predicting the Cancer Prognosis , 2014, BioMed research international.

[32]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[33]  Roberto Marcondes Cesar Junior,et al.  Gene Expression Complex Networks: Synthesis, Identification, and Analysis , 2011, J. Comput. Biol..

[34]  Rahul C. Deo,et al.  Interpreting cancer genomes using systematic host network perturbations by tumour virus proteins - eScholarship , 2012 .

[35]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[36]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[37]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[38]  Maqc Consortium The MicroArray Quality Control ( MAQC )-II study of common practices for the development and validation of microarray-based predictive models , 2012 .

[39]  Sameem Abdul Kareem,et al.  Oral cancer prognosis based on clinicopathologic and genomic markers using a hybrid of feature selection and machine learning methods , 2013, BMC Bioinformatics.