Finding disagreement pathway signatures and constructing an ensemble model for cancer classification

Cancer classification based on molecular level is a relatively routine research procedure with advances in high-throughput molecular profiling techniques. However, the number of genes typically far exceeds the number of the sample size in gene expression studies. The existing gene selection methods are almost based on statistics and machine learning, overlooking relevant biological principles or knowledge while working with biological data. Here, we propose a robust ensemble learning paradigm, which incorporates multiple pathways information, to predict cancer classification. We compare the proposed method with other methods, such as Elastic SCAD and PPDMF, and estimate the classification performance. The results show that the proposed method has the higher performances on most metrics and robust performance. We further investigate the biological mechanism of the ensemble feature genes. The results demonstrate that the ensemble feature genes are associated with drug targets/clinically-relevant cancer. In addition, some core biological pathways and biological process underlying clinically-relevant phenotypes are identified by function annotation. Overall, our research can provide a new perspective for the further study of molecular activities and manifestations of cancer.

[1]  Eytan Domany,et al.  Pathway‐based personalized analysis of breast cancer expression data , 2015, Molecular oncology.

[2]  Hsuan-Yu Chen,et al.  Pathway-based gene signatures predicting clinical outcome of lung adenocarcinoma , 2015, Scientific Reports.

[3]  Stanislaw Osowski,et al.  Gene selection for cancer classification , 2009 .

[4]  Yadong Wang,et al.  A network-based pathway-expanding approach for pathway analysis , 2016, BMC Bioinformatics.

[5]  J. Weinstein,et al.  Biomarkers in Cancer Staging, Prognosis and Treatment Selection , 2005, Nature Reviews Cancer.

[6]  Concha Bielza,et al.  Regularized logistic regression without a penalty term: An application to cancer classification with microarray data , 2011, Expert Syst. Appl..

[7]  Xiaodong Lin,et al.  Gene expression Gene selection using support vector machines with non-convex penalty , 2005 .

[8]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[9]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[10]  Fei Hua,et al.  Analysis of Mechanistic Pathway Models in Drug Discovery: p38 Pathway , 2008, Biotechnology progress.

[11]  ZhangHao Helen,et al.  Gene selection using support vector machines with non-convex penalty , 2006 .

[12]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[13]  Michal Sheffer,et al.  Pathway-based personalized analysis of cancer , 2013, Proceedings of the National Academy of Sciences.

[14]  S. Tian,et al.  Pathway-based feature selection algorithms identify genes discriminating patients with multiple sclerosis apart from controls , 2015, 1508.01509.

[15]  Dong Xu,et al.  Classification of lung cancer using ensemble-based feature selection and machine learning methods. , 2015, Molecular bioSystems.

[16]  Ching Y. Suen,et al.  Application of majority voting to pattern recognition: an analysis of its behavior and performance , 1997, IEEE Trans. Syst. Man Cybern. Part A.

[17]  Axel Benner,et al.  Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data , 2011, BMC Bioinformatics.

[18]  Q. Cui,et al.  Identification of high-quality cancer prognostic markers and metastasis network modules , 2010, Nature communications.

[19]  Leming Shi,et al.  Effect of training-sample size and classification difficulty on the accuracy of genomic predictors , 2010, Breast Cancer Research.

[20]  Jonathan H. Chan,et al.  Pathway activity transformation for multi-class classification of lung cancer datasets , 2015, Neurocomputing.

[21]  C. J. Robbins,et al.  Differentially Expressed Genes and Signature Pathways of Human Prostate Cancer , 2015, PloS one.

[22]  Chris Sander,et al.  Pathway information for systems biology , 2005, FEBS letters.

[23]  Charlotte Soneson,et al.  A comparison of methods for differential expression analysis of RNA-seq data , 2013, BMC Bioinformatics.

[24]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[25]  Charles DeLisi,et al.  Pathway-based classification of cancer subtypes , 2012, Biology Direct.

[26]  Lana X. Garmire,et al.  A Novel Model to Combine Clinical and Pathway-Based Transcriptomic Information for the Prognosis Prediction of Breast Cancer , 2014, PLoS Comput. Biol..

[27]  Liying Yang,et al.  Classifiers selection for ensemble learning based on accuracy and diversity , 2011 .

[28]  Ying Liu,et al.  Active Learning with Support Vector Machine Applied to Gene Expression Data for Cancer Classification , 2004, J. Chem. Inf. Model..

[29]  L. Pusztai,et al.  Estrogen receptor (ER) mRNA expression and molecular subtype distribution in ER-negative/progesterone receptor-positive breast cancers , 2014, Breast Cancer Research and Treatment.

[30]  Hao Li,et al.  Liverbase: a comprehensive view of human liver biology. , 2010, Journal of proteome research.

[31]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[32]  Doheon Lee,et al.  Inferring Pathway Activity toward Precise Disease Classification , 2008, PLoS Comput. Biol..

[33]  Xia Wang,et al.  Accurate prediction of RNA-binding protein residues with two discriminative structural descriptors , 2016, BMC Bioinformatics.

[34]  Y. Wang,et al.  A novel approach to feature extraction from classification models based on information gene pairs , 2008, Pattern Recognit..

[35]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[36]  Justin Zobel,et al.  Prediction of breast cancer prognosis using gene set statistics provides signature stability and biological context , 2010, BMC Bioinformatics.

[37]  Hanqing Xue,et al.  Network-based methods for identifying critical pathways of complex diseases: a survey. , 2016, Molecular bioSystems.

[38]  Nathan E. Lewis,et al.  Novel personalized pathway-based metabolomics models reveal key metabolic pathways for breast cancer diagnosis , 2016, Genome Medicine.

[39]  Niklas Lavesson,et al.  Comparative Analysis of Voting Schemes for Ensemble-based Malware Detection , 2013, J. Wirel. Mob. Networks Ubiquitous Comput. Dependable Appl..

[40]  Magda Tsolaki,et al.  A Pathway Based Classification Method for Analyzing Gene Expression for Alzheimer’s Disease Diagnosis , 2015, Journal of Alzheimer's disease : JAD.

[41]  Yu-Dong Cai,et al.  Prediction of Protein Cleavage Site with Feature Selection by Random Forest , 2012, PloS one.

[42]  E. Marcotte,et al.  Prioritizing candidate disease genes by network-based boosting of genome-wide association data. , 2011, Genome research.