The Impact of Pathway Database Choice on Statistical Enrichment Analysis and Predictive Modeling

Background Pathway-centric approaches are widely used to interpret and contextualize -omics data. However, databases contain different representations of the same biological pathway, which may lead to different results of statistical enrichment analysis and predictive models in the context of precision medicine. Results We have performed an in-depth benchmarking of the impact of pathway database choice on statistical enrichment analysis and predictive modeling. We analyzed five cancer datasets using three major pathway databases and developed an approach to merge several databases into a single integrative database: MPath. Our results show that equivalent pathways from different databases yield disparate results in statistical enrichment analysis. Moreover, we observed a significant dataset-dependent impact on performance of machine learning models on different prediction tasks. Further, MPath significantly improved prediction performance and reduced the variance of prediction performances in some cases. At the same time, MPath yielded more consistent and biologically plausible results in the statistical enrichment analyses. Finally, we implemented a software package designed to make our comparative analysis with these and additional databases fully reproducible and to facilitate the update of our integrative pathway resource in the future. Conclusion This benchmarking study demonstrates that pathway database choice can influence the results of statistical enrichment analysis and prediction modeling. Therefore, we recommend the use of multiple pathway databases or the use of integrative databases.

[1]  Gianluca Bontempi,et al.  TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data , 2015, Nucleic acids research.

[2]  F. Cardoso,et al.  Primary breast cancer: ESMO Clinical Practice Guidelines for diagnosis, treatment and follow-up. , 2019, Annals of oncology : official journal of the European Society for Medical Oncology.

[3]  Steven J. M. Jones,et al.  Comprehensive molecular characterization of gastric adenocarcinoma , 2014, Nature.

[4]  Doron Lancet,et al.  PathCards: multi-source consolidation of human biological pathways , 2015, Database J. Biol. Databases Curation.

[5]  Yifan Zhang,et al.  Toward the precision breast cancer survival prediction utilizing combined whole genome-wide expression and somatic mutation analysis , 2018, BMC Medical Genomics.

[6]  Daniel Blankenberg,et al.  Software engineering for scientific big data analysis , 2019, GigaScience.

[7]  Egon L. Willighagen,et al.  Beyond Pathway Analysis: Identification of Active Subnetworks in Rett Syndrome , 2019, Front. Genet..

[8]  E. Rutgers,et al.  Primary breast cancer: ESMO Clinical Practice Guidelines for diagnosis, treatment and follow-up. , 2015, Annals of oncology : official journal of the European Society for Medical Oncology.

[9]  Jesús Espinal-Enríquez,et al.  Pathway Analysis: State of the Art , 2015, Front. Physiol..

[10]  Charles Tapley Hoyt,et al.  PyBEL: a computational framework for Biological Expression Language , 2017, Bioinform..

[11]  Martin Hofmann-Apitius,et al.  PathMe: merging and exploring mechanistic pathway knowledge , 2018, BMC Bioinformatics.

[12]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[14]  Tim Beißbarth,et al.  Comparative study on gene set and pathway topology-based enrichment methods , 2015, BMC Bioinformatics.

[15]  Gabriele Sales,et al.  metaGraphite–a new layer of pathway annotation to get metabolite networks , 2018, Bioinform..

[16]  Eva Budinska,et al.  A critical comparison of topology-based pathway analysis methods , 2018, PloS one.

[17]  Mathew W. Wright,et al.  The HUGO Gene Nomenclature Committee (HGNC) , 2001, Human Genetics.

[18]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[19]  Martin Hofmann-Apitius,et al.  Integration of Structured Biological Data Sources using Biological Expression Language , 2019, bioRxiv.

[20]  Andrew H. Beck,et al.  Importance of collection in gene set enrichment analysis of drug response in cancer cell lines , 2014, Scientific Reports.

[21]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[22]  Ben S. Wittner,et al.  Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1 , 2009, Nature.

[23]  Martin Hofmann-Apitius,et al.  PathMe: Merging and exploring mechanistic pathway knowledge , 2019, BMC Bioinform..

[24]  Alexander JR Bishop,et al.  Pathway Distiller - multisource biological pathway consolidation , 2012, BMC Genomics.

[25]  Matthias Schmid,et al.  Boosting the Concordance Index for Survival Data – A Unified Framework To Derive and Evaluate Biomarker Combinations , 2013, PloS one.

[26]  Sun Kim,et al.  Comprehensive and critical evaluation of individualized pathway activity measurement tools on pan-cancer data , 2018, Briefings Bioinform..

[27]  Alex Alves Freitas,et al.  Comparing enrichment analysis and machine learning for identifying gene properties that discriminate between gene classes , 2019, Briefings Bioinform..

[28]  Holger Fröhlich,et al.  Including network knowledge into Cox regression models for biomarker signature discovery , 2014, Biometrical journal. Biometrische Zeitschrift.

[29]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[30]  Gary D Bader,et al.  Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap , 2019, Nature Protocols.

[31]  Julio Saez-Rodriguez,et al.  Creating and analyzing pathway and protein interaction compendia for modelling signal transduction networks , 2012, BMC Systems Biology.

[32]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Henning Hermjakob,et al.  The Reactome pathway Knowledgebase , 2015, Nucleic acids research.

[34]  Martin Hofmann-Apitius,et al.  ComPath: an ecosystem for exploring, analyzing, and curating mappings across pathway databases , 2018, npj Systems Biology and Applications.

[35]  Michal Sheffer,et al.  Pathway-based personalized analysis of cancer , 2013, Proceedings of the National Academy of Sciences.

[36]  Gavin Lynch,et al.  The control of the false discovery rate under structured hypotheses , 2014 .

[37]  Gary D. Bader,et al.  Pathway Commons, a web resource for biological pathway data , 2010, Nucleic Acids Res..

[38]  R. Fisher Statistical methods for research workers , 1927, Protoplasma.

[39]  Anne-Laure Boulesteix,et al.  Survival prediction using gene expression data: A review and comparison , 2009, Comput. Stat. Data Anal..

[40]  Sujoy Ghosh,et al.  Redundancy control in pathway databases (ReCiPa): an application for improving gene-set enrichment analysis in Omics studies and "Big data" biology. , 2013, Omics : a journal of integrative biology.

[41]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[42]  Ericka Stricklin-Parker,et al.  Ann , 2005 .

[43]  Ralf Herwig,et al.  ConsensusPathDB—a database for integrating human functional interaction networks , 2008, Nucleic Acids Res..

[44]  W. Han,et al.  Protein interaction network (PIN)-based breast cancer subsystem identification and activation measurement for prognostic modeling. , 2016, Methods.

[45]  Gary D. Bader,et al.  Pathguide: a Pathway Resource List , 2005, Nucleic Acids Res..

[46]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[47]  Atul J. Butte,et al.  Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges , 2012, PLoS Comput. Biol..

[48]  Minoru Kanehisa,et al.  KEGG: new perspectives on genomes, pathways, diseases and drugs , 2016, Nucleic Acids Res..

[49]  Pooja Mittal,et al.  A novel signaling pathway impact analysis , 2009, Bioinform..

[50]  R. Gelber,et al.  Tailoring therapies—improving the management of early breast cancer: St Gallen International Expert Consensus on the Primary Therapy of Early Breast Cancer 2015 , 2015, Annals of oncology : official journal of the European Society for Medical Oncology.

[51]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[52]  Lincoln D. Stein,et al.  Impact of outdated gene annotations on pathway enrichment analysis , 2016, Nature Methods.

[53]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[54]  Giancarlo Mauri,et al.  Pathway-based classification of breast cancer subtypes. , 2017, Frontiers in bioscience.

[55]  J. Mesirov,et al.  The Molecular Signatures Database Hallmark Gene Set Collection , 2015 .

[56]  Kenneth H. Buetow,et al.  PID: the Pathway Interaction Database , 2008, Nucleic Acids Res..

[57]  Goran Nenadic,et al.  Using set theory to reduce redundancy in pathway sets , 2018, BMC Bioinformatics.

[58]  Julio Saez-Rodriguez,et al.  OmniPath: guidelines and gateway for literature-curated signaling pathway resources , 2016, Nature Methods.

[59]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[60]  Ted Slater,et al.  Recent advances in modeling languages for pathway maps and computable biological networks. , 2014, Drug discovery today.

[61]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[62]  Ryan Miller,et al.  WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research , 2017, Nucleic Acids Res..

[63]  F. Harrell,et al.  Evaluating the yield of medical tests. , 1982, JAMA.