Robust edge-based biomarker discovery improves prediction of breast cancer metastasis

Background The abundance of molecular profiling of breast cancer tissues entailed active research on molecular marker-based early diagnosis of metastasis. Recently there is a surging interest in combining gene expression with gene networks such as protein-protein interaction (PPI) network, gene co-expression (CE) network and pathway information to identify robust and accurate biomarkers for metastasis prediction, reflecting the common belief that cancer is a systems biology disease. However, controversy exists in the literature regarding whether network markers are indeed better features than genes alone for predicting as well as understanding metastasis. We believe much of the existing results may have been biased by the overly complicated prediction algorithms, unfair evaluation, and lack of rigorous statistics. In this study, we propose a simple approach to use network edges as features, based on two types of networks respectively, and compared their prediction power using three classification algorithms and rigorous statistical procedure on one of the largest datasets available. To detect biomarkers that are significant for the prediction and to compare the robustness of different feature types, we propose an unbiased and novel procedure to measure feature importance that eliminates the potential bias from factors such as different sample size, number of features, as well as class distribution. Results Experimental results reveal that edge-based feature types consistently outperformed gene-based feature type in random forest and logistic regression models under all performance evaluation metrics, while the prediction accuracy of edge-based support vector machine (SVM) model was poorer, due to the larger number of edge features compared to gene features and the lack of feature selection in SVM model. Experimental results also show that edge features are much more robust than gene features and the top biomarkers from edge feature types are statistically more significantly enriched in the biological processes that are well known to be related to breast cancer metastasis. Conclusions Overall, this study validates the utility of edge features as biomarkers but also highlights the importance of carefully designed experimental procedures in order to achieve statistically reliable comparison results.

[1]  Fabien Reyal,et al.  Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability , 2008, BMC Genomics.

[2]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[3]  Justin Zobel,et al.  Prediction of breast cancer prognosis using gene set statistics provides signature stability and biological context , 2010, BMC Bioinformatics.

[4]  Rebecca L. Siegel Mph,et al.  Cancer statistics, 2018 , 2018 .

[5]  Lodewyk F. A. Wessels,et al.  Current composite-feature classification methods do not outperform simple single-genes classifiers in breast cancer prognosis , 2013, Front. Genet..

[6]  David Warde-Farley,et al.  Dynamic modularity in protein interaction networks predicts breast cancer outcome , 2009, Nature Biotechnology.

[7]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[8]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[9]  K. Gunsalus,et al.  Network modeling links breast cancer susceptibility and centrosome dysfunction. , 2007, Nature genetics.

[10]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[11]  Cheng Liang,et al.  Inferring probabilistic miRNA–mRNA interaction signatures in cancers: a role-switch approach , 2014, Nucleic acids research.

[12]  S. Williams,et al.  Pearson's correlation coefficient. , 1996, The New Zealand medical journal.

[13]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[14]  Hyojin Kim,et al.  WormNet v3: a network-assisted hypothesis-generating server for Caenorhabditis elegans , 2014, Nucleic Acids Res..

[15]  Tao Jiang,et al.  Differential regulation enrichment analysis via the integration of transcriptional regulatory network and gene expression data , 2015, Bioinform..

[16]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[17]  A. Jemal,et al.  Cancer statistics, 2016 , 2016, CA: a cancer journal for clinicians.

[18]  J. Peterse,et al.  Breast cancer metastasis: markers and models , 2005, Nature Reviews Cancer.

[19]  Y. Lu,et al.  Combination of hsa-miR-375 and hsa-miR-142-5p as a predictor for recurrence risk in gastric cancer patients following surgical resection. , 2011, Annals of oncology : official journal of the European Society for Medical Oncology.

[20]  Kara Dolinski,et al.  The BioGRID interaction database: 2019 update , 2018, Nucleic Acids Res..

[21]  Gilles Louppe,et al.  Independent consultant , 2013 .

[22]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[23]  Yuhua Li,et al.  Association of polymorphisms in survivin gene with the risk of hepatocellular carcinoma in Chinese han population: a case control study , 2012, BMC Medical Genetics.

[24]  Xing-Ming Zhao,et al.  Identifying disease genes and module biomarkers by differential interactions , 2012, J. Am. Medical Informatics Assoc..

[25]  Marcel J. T. Reinders,et al.  Integrating Protein-Protein Interaction Networks with Gene-Gene Co-Expression Networks improves Gene Signatures for Classifying Breast Cancer Metastasis , 2011, J. Integr. Bioinform..

[26]  Nancy Chinchor,et al.  MUC-4 evaluation metrics , 1992, MUC.

[27]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[28]  A. Jemal,et al.  Cancer statistics, 2018 , 2018, CA: a cancer journal for clinicians.

[29]  Amin Allahyar,et al.  FERAL: network-based classifier with application to breast cancer outcome prediction , 2015, Bioinform..

[30]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[31]  Jianhua Ruan,et al.  Comparative evaluation of network features for the prediction of breast cancer metastasis , 2020, BMC Medical Genomics.

[32]  Martin Ester,et al.  Inferring cancer subnetwork markers using density-constrained biclustering , 2010, Bioinform..

[33]  P. Qiu The Statistical Evaluation of Medical Tests for Classification and Prediction , 2005 .

[34]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[35]  Lodewyk F. A. Wessels,et al.  A Critical Evaluation of Network and Pathway-Based Classifiers for Outcome Prediction in Breast Cancer , 2011, PloS one.

[36]  N. Dubrawsky Cancer statistics , 1989, CA: a cancer journal for clinicians.

[37]  Sol Efroni,et al.  PhenoNet: identification of key networks associated with disease phenotype , 2014, Bioinform..

[38]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[39]  Michael Schroeder,et al.  Google Goes Cancer: Improving Outcome Prediction for Cancer Patients by Network-Based Ranking of Marker Genes , 2012, PLoS Comput. Biol..

[40]  Trevor Hastie,et al.  Averaged gene expressions for regression. , 2007, Biostatistics.

[41]  Christian Schaefer,et al.  Protein secondary structure appears to be robust under in silico evolution while protein disorder appears not to be , 2010, Bioinform..

[42]  Jan Baumbach,et al.  De novo pathway-based biomarker identification , 2017, Nucleic acids research.

[43]  Wanwei Zhang,et al.  EdgeMarker: Identifying differentially correlated molecule pairs as edge-biomarkers. , 2014, Journal of theoretical biology.

[44]  Doheon Lee,et al.  Inferring Pathway Activity toward Precise Disease Classification , 2008, PLoS Comput. Biol..