Predictive modelling using pathway scores: robustness and significance of pathway collections

BackgroundTranscriptomic data is often used to build statistical models which are predictive of a given phenotype, such as disease status. Genes work together in pathways and it is widely thought that pathway representations will be more robust to noise in the gene expression levels. We aimed to test this hypothesis by constructing models based on either genes alone, or based on sample specific scores for each pathway, thus transforming the data to a ‘pathway space’. We progressively degraded the raw data by addition of noise and examined the ability of the models to maintain predictivity.ResultsModels in the pathway space indeed had higher predictive robustness than models in the gene space. This result was independent of the workflow, parameters, classifier and data set used. Surprisingly, randomised pathway mappings produced models of similar accuracy and robustness to true mappings, suggesting that the success of pathway space models is not conferred by the specific definitions of the pathway. Instead, predictive models built on the true pathway mappings led to prediction rules with fewer influential pathways than those built on randomised pathways. The extent of this effect was used to differentiate pathway collections coming from a variety of widely used pathway databases.ConclusionsPrediction models based on pathway scores are more robust to degradation of gene expression information than the equivalent models based on ungrouped genes. While models based on true pathway scores are not more robust or accurate than those based on randomised pathways, true pathways produced simpler prediction rules, emphasizing a smaller number of pathways.

[1]  Doheon Lee,et al.  Inferring Pathway Activity toward Precise Disease Classification , 2008, PLoS Comput. Biol..

[2]  Raffaella Corvi,et al.  The carcinoGENOMICS project: critical selection of model compounds for the development of omics-based in vitro carcinogenicity screening assays. , 2008, Mutation research.

[3]  Jeffrey T. Chang,et al.  Oncogenic pathway signatures in human cancers as a guide to targeted therapies , 2006, Nature.

[4]  Jan Baumbach,et al.  Syddansk Universitet De novo pathway-based biomarker identification , 2017 .

[5]  Mads Thomassen,et al.  Prediction of Breast Cancer Metastasis by Gene Expression Profiles: A Comparison of Metagenes and Single Genes , 2012, Cancer informatics.

[6]  Charles DeLisi,et al.  Pathway-based classification of cancer subtypes , 2012, Biology Direct.

[7]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[8]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[9]  Filip Zelezný,et al.  Comparative evaluation of set-level techniques in predictive classification of gene expression samples , 2012, BMC Bioinformatics.

[10]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[11]  Seungwoo Hwang Comparison and evaluation of pathway-level aggregation methods of gene expression data , 2012, BMC Genomics.

[12]  Peter J. Park,et al.  A multivariate approach for integrating genome-wide expression data and biological knowledge , 2006, Bioinform..

[13]  Ralf Herwig,et al.  ConsensusPathDB—a database for integrating human functional interaction networks , 2008, Nucleic Acids Res..

[14]  Anne-Laure Boulesteix,et al.  Added predictive value of high-throughput molecular data to clinical data and its validation , 2011, Briefings Bioinform..

[15]  Nandini Raghavan,et al.  On Methods for Gene Function Scoring as a Means of Facilitating the Interpretation of Microarray Results , 2006, J. Comput. Biol..

[16]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[17]  David Venet,et al.  Most Random Gene Expression Signatures Are Significantly Associated with Breast Cancer Outcome , 2011, PLoS Comput. Biol..

[18]  Amin Allahyar,et al.  FERAL: network-based classifier with application to breast cancer outcome prediction , 2015, Bioinform..

[19]  Benjamin M. Bolstad,et al.  affy - analysis of Affymetrix GeneChip data at the probe level , 2004, Bioinform..

[20]  Lodewyk F. A. Wessels,et al.  Current composite-feature classification methods do not outperform simple single-genes classifiers in breast cancer prognosis , 2013, Front. Genet..

[21]  Ben S. Wittner,et al.  Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1 , 2009, Nature.

[22]  M. Orešič,et al.  Pathways to the analysis of microarray data. , 2005, Trends in biotechnology.

[23]  Daniel R. Salomon,et al.  Strategies for aggregating gene expression data: The collapseRows R function , 2011, BMC Bioinformatics.

[24]  James J. Chen,et al.  Development of biomarker classifiers from high-dimensional data , 2009, Briefings Bioinform..

[25]  Lodewyk F. A. Wessels,et al.  A Critical Evaluation of Network and Pathway-Based Classifiers for Outcome Prediction in Breast Cancer , 2011, PloS one.

[26]  Jun Lu,et al.  Pathway level analysis of gene expression using singular value decomposition , 2005, BMC Bioinformatics.