Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study

Abstract Motivation High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets. Results The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate. Availability and implementation An R package providing functions for data analysis and simulation is available at GitHub (https://github.com/szymczak-lab/PathwayGuidedRF). An accompanying R data package (https://github.com/szymczak-lab/DataPathwayGuidedRF) stores the processed and quality controlled experimental datasets downloaded from Gene Expression Omnibus (GEO). Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Stephan Seifert,et al.  Surrogate minimal depth as an importance measure for variables in random forests , 2019, Bioinform..

[2]  R. Crystal,et al.  Cigarette smoking reprograms apical junctional complex molecular architecture in the human airway epithelium in vivo , 2011, Cellular and Molecular Life Sciences.

[3]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[4]  Ali Shojaie,et al.  A comparative study of topology-based pathway enrichment analysis methods , 2019, BMC Bioinformatics.

[5]  K. Famulski,et al.  Molecular phenotypes of acute kidney injury in kidney transplants. , 2012, Journal of the American Society of Nephrology : JASN.

[6]  Zengyou He,et al.  Stable Feature Selection for Biomarker Discovery , 2010, Comput. Biol. Chem..

[7]  Kristin K. Nicodemus,et al.  Letter to the Editor: On the stability and ranking of predictors from random forest variable importance measures , 2011, Briefings Bioinform..

[8]  Jiří Kléma,et al.  Network-constrained forest for regularized classification of omics data. , 2015, Methods.

[9]  Debashis Ghosh,et al.  Integrative set enrichment testing for multiple omics platforms , 2011, BMC Bioinformatics.

[10]  Jason H. Moore,et al.  A System‐Level Pathway‐Phenotype Association Analysis Using Synthetic Feature Random Forest , 2014, Genetic epidemiology.

[11]  Anne-Laure Boulesteix,et al.  A computationally fast variable importance test for random forests for high-dimensional data , 2015, Adv. Data Anal. Classif..

[12]  Stefano Nembrini,et al.  The revival of the Gini importance? , 2018, Bioinform..

[13]  J. Mesirov,et al.  The Molecular Signatures Database Hallmark Gene Set Collection , 2015 .

[14]  Roberto Romero,et al.  A Comparison of Gene Set Analysis Methods in Terms of Sensitivity, Prioritization and Specificity , 2013, PloS one.

[15]  Khader Shameer,et al.  Gene expression profiling of peripheral blood mononuclear cells in the setting of peripheral arterial disease , 2012, Journal of Clinical Bioinformatics.

[16]  Daniel Marbach,et al.  Assessment of network module identification across complex diseases , 2019, Nature Methods.

[17]  Ralf Herwig,et al.  The ConsensusPathDB interaction database: 2013 update , 2012, Nucleic Acids Res..

[18]  Alexey Sergushichev,et al.  Fast gene set enrichment analysis , 2019, bioRxiv.

[19]  Xi Chen,et al.  Random survival forests for high‐dimensional data , 2011, Stat. Anal. Data Min..

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  Korbinian Strimmer,et al.  BMC Bioinformatics BioMed Central Methodology article A general modular framework for gene set enrichment analysis , 2009 .

[22]  P. Park,et al.  Discovering statistically significant pathways in expression profiling studies. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Sorin Draghici,et al.  Identifying significantly impacted pathways: a comprehensive review and assessment , 2019, Genome Biology.

[24]  John N Weinstein,et al.  The LeFE algorithm: embracing the complexity of gene expression in the interpretation of microarray data , 2007, Genome Biology.

[25]  Md. Nurul Islam,et al.  Simulating Gene Expression Data To Estimate Sample Size For Class and Biomarker Discovery , 2012 .

[26]  Rory Wilson,et al.  Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies , 2017, BMC Medical Research Methodology.

[27]  Ali Shojaie,et al.  Gene set analysis methods: a systematic comparison , 2018, BioData Mining.

[28]  Gary D. Bader,et al.  Pathguide: a Pathway Resource List , 2005, Nucleic Acids Res..

[29]  Graziano Pesole,et al.  Statistical assessment of functional categories of genes deregulated in pathological conditions by using microarray data , 2007, Bioinform..

[30]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[31]  Atul J. Butte,et al.  Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges , 2012, PLoS Comput. Biol..

[32]  Li-Jen Su,et al.  Protein arginine methyltransferase 5 is a potential oncoprotein that upregulates G1 cyclins/cyclin‐dependent kinases and the phosphoinositide 3‐kinase/AKT signaling cascade , 2012, Cancer science.

[33]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Sean R. Davis,et al.  GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor , 2007, Bioinform..

[35]  A. Butte,et al.  Progressive histological damage in renal allografts is associated with expression of innate and adaptive immunity genes. , 2011, Kidney international.

[36]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[37]  Henning Hermjakob,et al.  The Reactome pathway Knowledgebase , 2015, Nucleic acids research.

[38]  E. Domany,et al.  Do Two Machine-Learning Based Prognostic Signatures for Breast Cancer Capture the Same Biological Processes? , 2011, PloS one.

[39]  Leonhard Held,et al.  Spatio-Temporal Analysis of Epidemic Phenomena Using the R Package surveillance , 2014, ArXiv.

[40]  L. Schalkwyk,et al.  Peripheral blood RNA gene expression profiling in patients with bacterial meningitis , 2013, Front. Neurosci..

[41]  Fred A Wright,et al.  Sex differences in the human peripheral blood transcriptome , 2014, BMC Genomics.

[42]  Ping Wang,et al.  Identification of genes with a correlation between copy number and expression in gastric cancer , 2012, BMC Medical Genomics.

[43]  Alexey Sergushichev,et al.  An algorithm for fast preranked gene set enrichment analysis using cumulative statistic calculation , 2016 .

[44]  Frauke Degenhardt,et al.  Evaluation of variable selection methods for random forests and omics data sets , 2017, Briefings Bioinform..

[45]  T. Niewold,et al.  Activation of the Interferon Pathway is Dependent Upon Autoantibodies in African-American SLE Patients, but Not in European-American SLE Patients , 2013, Front. Immunol..

[46]  Rok Blagus,et al.  Class prediction for high-dimensional class-imbalanced data , 2010, BMC Bioinformatics.

[47]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[48]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[49]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[50]  Xi Chen,et al.  Pathway hunting by random survival forests , 2013, Bioinform..

[51]  Michal A. Kurowski,et al.  Transcriptome Profile of Human Colorectal Adenomas , 2007, Molecular Cancer Research.

[52]  Witold R. Rudnicki,et al.  Feature Selection with the Boruta Package , 2010 .

[53]  Hongyu Zhao,et al.  Pathway analysis using random forests classification and regression , 2006, Bioinform..